AD-A241  225 


REPORT  DOCUMENTATION  PAGE 


i  lb:  RESTRICTIVE  MARKINGS 


«.  rcnrvnminu  WWtAWtfAMON  REPORT  NUMIER(S) 

WIT/LCS/TR  514 


(4.  NAME  OR  PERFORMING  ORGANIZATION 

KIT  Lab  for  Coaputar  Science 


tc  ADDRESS  (dry,  SOW.  Mtf  ?*Coc*) 


I  3 .  OIS7RJIUTION/AVAILAIIUTY  OF  REPORT 

Approved  for  public  release;  distribution 
is  unlimited. 


S.  MONITORING  ORGANIZATION  REPORT  NUMSIR(S) 

N00014-89-J-1988 


545  Technology  Square 
Cambridge,  MA  02139 


B*.  NAME  OF  FUNDING/ SPONSORING 
ORGANIZATION 

DAKFA/DOD 


3c.  ADDRESS  CC/fy,  Sun,  *na  Z/PCooeJ 

1400  Wilson  31va. 
Arlington,  VA  22117 


1 1  TITLE  (IntiuM  Sttuf'ty  ClMificition) 


7b.  AODRESS  (City,  sate,  *od  zrcoeei 

Information  Systeaa  Program 
Arlington,  VA  22217 


•b.  OFFICE  SYMBOL  9.  POQClAIMINT  INSTRUMENT  IDENTIFICATION  NUMBER 

(if  tppHtik)  ,  I  , . .  . 


10  SOURCE  CF  SUNOING  NUMBERS 


PROGRAM  PROJECT 

ELEMENT  NO  NO.  . 


WOP*  UNIT 
ACCE  .SION  NO. 


Algorithm?  for  Scheduling  and  Network  Problems 


12  PERSONAL  authors; 


i 3<  TYPE  OS  REPORT 

Technical 


16.  SUPPLEMENTARY  NOTATION 


Joel  Martin  Vein 


0A  August  ’S  C0UN’ 


COSAT;  COOES _ 

GROUP  I  SUB-GROUP 


18.  SUBJECT  TERMS  (Conr/nwe  on  reverte  if  nttsury  tnd  itonvfy  o y  b/ocir  rwmmtO 
Scheduling,  combinatorial  optimization,  network  algorithm!, 
parallel  computation,  connection  machine,  approximation 


i9  ABSTRACT  (Co/innuf  on  reverj «  if  n*c tiury  tnd  identify  by  blow  nvmbtr) 


«?♦ 


$ 


la  this  thesis  w«  develop  Algorithm*  for  two  b**ic  cIuim  of  problem*  in 
corabinatorisl  optimiZAtion:  dttermmutie  machine  tchcduh np  end  nttworlc 
optimuation. 

In  the  lint  pert  of  the  t  bet  it  we  contider  spproximstion  Algorithm*  for 
two  bseic  scheduling  esvironmentt:  shop  Khtdulinj  sod  pontUl  machint 
tehtdulint.  We  give  spproxjmstion  Algorithm*  for  ibop  icheduiing  that  tig- 
ruftcuttiy  improve  upon  the  performsnee  of  previou*  Algorithm*.  We  then 
uudy  on-line  ApproiumAtion  Algorithm*  for  p*r*JJei  mAchine  icheduiing 
y)  In  the  »e«ond  pert  of  the  thesis  we  pretent  teversi  theoreticsl  And  pm- 
ucaJ  result*  About  pataIIcI  Algorithm*  for  networlt  optimitAtion  problem*, 
including 


‘^-CiSTRiBUTiON.  AVAILA3.LITY  OF  ABSTRACT 
□  UNCLASSiF'EOAJNLIMItEO  □  SAME  AS  RPT 
12*  NAME  OS  RESPONSES  iNOlVIOUAL 


30  FORM  1473.84  mar 


:a a  2i.  abstract  security  classification 

as  rpt  □  sue  users  Unclassified 

22b  TELEPHONE  (Inert*  Are*  Code!  22c.  OFFICE  SY 

(617)  253-5894 


2*  fflitiOH  FTi Jy  0#  wStO  fJEhtUSttO  SECURITY  <1  C 


IglC^TjON  O*  THIS  PAGE 


A. I  OT.ntf  foitions  art  00J0‘ft« 


*UA  Qknimk  M«r|  i 

asifieo  - 


4 


•  A  Las  Vegas  KMC  algorithm  for  the  minimum-weight  perfect  matching 
problem  when  the  weights  are  input  in  unary. 

•  A  proof  that  approximating  the  value  of  the  minimum-cost  maximum 
Sow  is  P-Complete. 

•  An  experimental  study  of  implementations  of  algorithms  to  solve  the 
dense  assignment  problem  on  a  real  massively  parallel  computer,  the 
Connection  Machine  CM-2. 


|  Aeoesslon  For  '  /\ 

HIIS  GRAM 
DTIC  TAB 
Unannounced 
Juatifloati 

□ 

□ 

*r— 

Distribution/ 

Availability  Codes 

Diet 

ft'l 

fl 

• 

and/or 

Ui 

Algorithms  for  Scheduling  and  Network  Problems 

by 

Joel  Martin  Wein 

A.B.,  Applied  Mathematics 
Harvard  University 
(1985) 


Submitted  to  the  Department  of  Mathematics 
in  partial  fulfillment  of  the  requirements  for  the  degree  of 

Doctor  of  Philosophy 

at  the 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
■September  1991 

©  Massachusetts  Institute  of  Technology  1991 


Signature  of  Author. 


Department  of  Mathematics 
August  9,  1991 


Certified  by. 


David  B.  Shmoys 

Assistant  Professor  of  Industrial  Engineering  and  Operations  Research, 

Cornell  University 
Thesis  Supervisor 


Accepted  by. 


Alar  Toomre 
Chairman,  Applied  Math  Committee 


Accepted  by. 


Sigurdur  Helgason 


Algorithms  for  Scheduling  and  Network  Problems 

by 

Joel  Martin  Wein 


Submitted  to  the  Department  of  Mathematics 
on  August  9,  1991, 

in  partial  fulfillment  of  the  requirements  for  the  degree  of 
Doctor  of  Philosophy 


Abstract 

In  this  thesis  we  consider  the  development  of  algorithms  for  two  very  basic  classes  of  problems  in 
combinatorial  optimization:  deterministic  machine  scheduling  and  network  optimization.  Most 
of  the  scheduling  problems  we  consider  are  J\fV- Complete  and  therefore  we  study  approxima¬ 
tion  algorithms  for  these  problems.  The  network  problems  are  all  know  n  to  be  polynomial 
time-solvable;  we  will  be  interested  in  parallel  algorithms  for  these  problems.  An  important 
connection  between  the  two  topics  of  this  thesis  is  that  the  scheduling  problems  we  consider 
are  related  to  problems  that  arise  in  the  design  of  parallel  computers;  they  are  also,  however, 
basic  building  blocks  in  the  theory  of  combinatorial  scheduling. 

We  consider  the  two  basic  environments  for  scheduling  a  set  of  machines:  shop  scheduling 
and  parallel  machine  scheduling.  In  the  shop  scheduling  problem  we  are  given  m  machines 
and  n  jobs;  a  job  consists  of  an  ordered  set  of  operations,  each  of  which  must  be  processed 
on  a  specified  machine.  The  aim  is  to  complete  all  jobs  as  quickly  as  possible.  This  problem 
is  strongly  A/’P-hard  even  for  very  restrictive  special  cases,  and  very  little  was  known  about 
approximation  algorithms  for  it.  We  give  the  first  randomized  and  deterministic  polynomial¬ 
time  approximation  algorithms  that  yield  polylogarithmic  approximations  to  the  optimal  length 
schedule.  We  also  give  the  first  parallel  approximation  algorithms  for  shop  scheduling.  Most 
of  our  results  apply  to  the  important  generalization  in  which  there  are  m'  types  of  machines, 
a  specified  number  of  machines  of  each  type,  and  each  operation  must  be  processed  on  one  of 
the  machines  of  a  specified  type. 

In  the  parallel  machine  environment  a  job  consists  of  one  operation  which  can  be  processed 
on  any  of  the  machines.  Most  of  the  basic  questions  about  approximation  algorithms  for  these 
problems  have  already  been  answered,  but  the  algorithms  that  have  been  developed  are  typically 
off-line,  meaning  that  they  must  be  given  the  entire  specification  of  a  problem  instance  before 
they  begin  to  construct  a  schedule.  This  does  not  model  many  real-world  problems,  including 
scheduling  of  jobs  on  a  multiprocessor.  We  study  the  problem  of  scheduling  jobs  on  parallel 
machines  in  an  on-line  fashion,  where  the  existence  of  a  job  is  not  known  until  an  unknown 
release  date,  and  the  processing  requirement  of  a  job  is  not  known  until  the  job  is  processed  to 
completion.  Despite  this  lack  of  knowledge  of  the  future,  we  wish  to  schedule  so  as  to  minimize 
the  completion  time  of  the  entire  set  of  jobs. 

We  give  two  rather  general  techniques  to  convert  algorithms  that  require  more  knowledge 


iv 


about  the  input  data  into  algorithms  that  need  less  advance  knowledge.  As  a  result  we  are  able 
to  give  on-line  approximation  algorithms  for  all  of  the  fundamental  parallel  machine  models. 
In  most  of  these  models  we  are  able  to  show  that  our  algorithms  are  asymptotically  optimal. 

In  the  second  part  of  this  thesis  we  consider  both  theoretical  and  practical  issues  in  the  design 
of  parallel  algorithms  for  network  optimization  problems.  We  first  consider  the  minimum  weight 
perfect  matching  problem,  where  the  weights  are  input  in  unary.  All  previous  TZJVC  algorithms 
for  this  problem  were  Monte-Carlo  algorithms,  which  produced  a  correct  solution  with  high 
probability  but  gave  no  indication  when  they  failed.  We  give  a.  Las  Vegas  TIJfC  algorithm  for 
the  problem  -  one  that  certifies,  with  high  probability,  that  its  solution  is  correct  -  utilizing 
reductions  between  minimum  weight  perfect  matching  and  the  T-join  problem.  We  also  show 
how  to  apply  the  technique  to  a  number  of  other  combinatorial  problems. 

We  then  consider  the  problem  of  finding  a  minimum-cost  maximum  flow  in  a  network.  This 
problem  is  known  to  be  ^-Complete,  and  therefore  it  is  expected  that  there  exists  no  JVC 
algorithm  for  this  problem.  We  prove  that  even  approximating  the  value  of  the  minimum-cost 
maximum  flow  is  V- Complete. 

Finally  we  present  an  experimental  study  of  implementations  of  algorithms  to  solve  the 
dense  assignment  problem  on  a  real  massively  parallel  computer,  the  Connection  Machine  CM- 
2.  We  describe  the  implementations  of  five  different  algorithms  for  the  problem,  including 
a  new  hybrid  approach  that  we  developed,  and  discuss  other  approaches  which  we  did  not 
implement.  We  evaluate  the  implementations  empirically;  the  best  proves  to  be  the  hybrid 
auction  algorithm. 

Keywords:  Scheduling,  Combinatorial  Optimization,  Network  Algorithms,  Parallel  Compu¬ 
tation,  Connection  Machine,  Approximation  Algorithms,  On-line  Algorithms. 
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In  many  areas  of  human  endeavor  the  problem  arises  of  choosing  the  “best”  of  a  large  set  of 
possibilities.  For  example,  an  employer  may  want  to  choose  a  schedule  for  her  employees  that 
maximizes  their  productivity,  a  trucking  company  may  want  to  find  the  least  expensive  way  to 
transport  goods  across  the  country,  or  a  city  planner  may  want  to  find  the  optimal  places  in 
which  to  situate  firehouses  in  order  to  maximize  the  safety  of  the  city.  These  are  all  optimization 
problems:  problems  which  require  the  optimization  of  a  function  of  some  set  of  variables,  subject 
to  certain  constraints  on  the  values  of  the  variables.  Combinatorial  optimization  refers  to  the 
class  of  optimization  problems  that  require  the  choice  of  one  solution  from  a  finite  set  of  discrete 
solutions,  where  the  constraints  and  the  variables  typically  describe  the  combinatorial  structure 
of  the  problem. 

There  is  a  rich  and  longstanding  relationship  between  combinatorial  optimization  and  com¬ 
puter  science.  The  development  of  efficient  algorithms  to  solve  basic  problems  in  combinatorial 
optimization  has  been  a  major  component  in  the  growth  of  the  theory  of  algorithms.  In  ad¬ 
dition,  a  number  of  the  resource  allocation  problems  that  arise  in  the  design  and  control  of 
computer  systems  can  be  modeled  as  problems  in  combinatorial  optimization. 

The  introduction  and  growing  popularity  over  the  last  decade  of  parallel  computers  has 
motivated  a  large  and  important  group  of  problems  in  both  of  these  categories.  There  are  a 
variety  of  large  optimization  problems  that  would  be  desirable  to  solve  in  real  time.  One  can 
imagine  using  optimization  techniques  to  direct  traffic  in  a  congested  city  during  rush  hour,  or 
to  coordinate  a  large  fleet  of  service  people  to  respond  to  requests,  or  to  dynamically  schedule 
airplanes  so  as  to  avoid  collisions.  It  is  therefore  important  to  understand  on  what  sorts  of 
optimization  problems  parallel  computing  will  yield  significant  speedups  and  on  which  it  will 
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not. 

Furthermore,  the  design  and  control  questions  about  parallel  computing  systems  are  quite 
different  than  those  about  sequential  machines,  since  they  require  the  allocation  of  tasks  and 
resources  to  a  large  number  of  separate  processors  or  processes,  in  contrast  to  just  one  or  a  few. 
Therefore,  questions  about  the  scheduling  of  or  the  allocation  of  resources  to  multiple  machines 
take  on  a  much  greater  importance. 

In  this  thesis  we  will  explore  both  aspects  of  the  relationship  between  parallel  computation 
and  combinatorial  optimization  by  studying  basic  algorithmic  questions  in  two  fundamental 
areas  of  combinatorial  optimization:  scheduling  theory  and  network  optimization.  In  the  first 
section  we  will  study  approximation  algorithms  for  the  twrvclassic  models  of  scheduling  a  set 
of  machines:  the  parallel  machine  model  and  the  shop  model.  These  are  in  and  of  themselves 
basic  problems  in  combinatorial  optimization,  but  are  particularly  appealing  to  us  because  they 
bear  some  relationship  to  scheduling  questions  about  parallel  computers.  The  former  is  closely 
related  to  parallel  computation  since  it  models  the  scheduling  of  tasks  on  a  multiprocessor 
[8,  93],  whereas  the  latter  has  found  applications  to  packet  routing  in  parallel  and  distributed 
networks[25,  82,  88]. 

Before  our  work  relatively  little  was  known  about  approximation  algorithms  for  shop  schedul¬ 
ing.  In  Chapter  2  we  give  approximation  algorithms  for  several  variants  of  this  problem  that 
achieve  significantly  better  performance  guarantees  than  previous  algorithms.  We  also  give  the 
best  parallel  approximation  algorithms  for  several  of  these  problems. 

In  contrast  to  shop  scheduling,  the  state  of  the  art  in  approximation  algorithms  for  parallel 
machine  scheduling  prior  to  our  work  was  excellent.  Almost  all  of  the  known  approximation 
algorithms,  however,  were  off-line  algorithms,  in  that  they  required  the  entire  specification  of 
the  problem  in  advance.  This  situation  does  not  reflect  that  of  many  real  world  scheduling 
problems,  since  often  the  scheduler  does  not  have  in  advance  complete  knowledge  of  a  job’s 
running  time,  or  of  what  jobs  will  be  created  and  require  processing  in  the  future.  In  Chapter  3 
we  study  on-line  algorithms  for  scheduling  parallel  machines,  giving  matching  or  near-matching 
upper  and  lower  bounds  on  how  well  an  on-line  scheduler  can  perform  in  each  of  the  fundamental 
models  of  parallel  machine  scheduling. 

In  the  second  section  of  this  thesis  we  turn  to  the  design  and  analysis  of  parallel  algorithms 
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for  problems  in  network  optimization.  We  focus  on  two  basic  problems:  that  of  finding  a 
maximum  weight  matching  in  a  graph,  and  that  of  finding  a  minimum  cost  flow  in  a  network. 

Both  the  minimum-cost  flow  problem  and  the  simpler  maximum-flow  problem  are  'P-Comptete 
problems,  and  therefore  it  is  widely  believed  that  there  are  no  <;fast”  parallel  algorithms  to  solve 
them.  By  “fast”  we  mean  algorithms  that,  in  a  PRAM  model  of  computation  with  a  number 
of  processors  polynomial  in  the  size  of  the  input,  solve  the  problem  in  worst  case  time  poly- 
logarithmic  in  the  size  of  the  problem.  In  chapter  5  we  give  an  interesting  separation  between 
the  parallel  complexity  of  these  two  problems,  showing  that  the  minimum-cosi;  maximum  flow 
problem  can  not  be  approximated  in  A/C  unless  V  =  A fC.  In  contrast,  it  is  known  that  the 
maximum  flow  problem  can  be  approximated  arbitrarily  closely  in  TZAfC. 

In  Chapters  6  and  7  we  consider  both  theoretical  and  practical  issues  in  the  parallel  solution 
of  weighted  matching  problems.  Theoretically  “fast”  randomized  parallel  algorithms  for  the 
minimum  weight  perfect  matching  problem,  when  the  weights  are  input  in  unary,  were  given 
by  Karp,  Upfal  and  Wigderson  and  by  Mulmuley,  Vazirani  and  Vazirani.  Those  algorithms 
produced  a  correct  solution  with  high  probability,  but  could  not  distinguish  between  success 
and  failure.  In  Chapter  6  we  give  a  Las  Vegas  algorithm  for  the  problem,  one  that  with  high 
probability  produces  a  correct  solution  and  otherwise  indicates  it  has  failed.  The  technique  is 
fairly  general  and  has  applications  to  a  number  of  other  combinatorial  problems. 

The  assignment  problem  is  the  special  case  of  the  minimum-weight  perfect  matching  problem 
when  the  graph  is  bipartite.  In  chapter  7  we  study  the  design  and  implementation  of  algorithms 
to  solve  this  problem  on  the  Connection  Machine  CM-2,  a  massively  parallel  computer.  We 
describe  the  implementations  of  five  different  algorithms  and  evaluate  their  performance  exper¬ 
imentally.  The  best  proved  to  be  a  new  hybrid  approach  which  we  developed  that  is  able  to 
take  advantage  of  two  different  levels  of  the  parallelism  of  the  Connection  Machine.  We  also 
discuss  and  attempt  to  evaluate  the  other  approaches  which  we  did  not  implement. 

Chapter  1  of  this  thesis  gives  an  introduction  to  scheduling  theory,  the  scheduling  models 
we  will  be  considering,  and  our  results,  which  are  presented  in  the  following  two  chapters. 
Similarly  Chapter  4  gives  an  introduction  to  parallel  network  optimization  and  our  results, 
which  are  presented  in  Chapters  5,  6  and  7. 


Chapter  1 


Scheduling  Theory:  An  Introduction 


Scheduling  theory  is  concerned  with  the  optimal  allocation  of  scarce  resources  to  activities  over 
time  [81].  The  subject  has  been  of  interest  to  the  human  race  since  the  first  time  two.  human 
beings  wished  to  use  the  same  resource  and  chose  to  settle  the  matter  without  bloodshed. 
More  recently,  the  industrial  revolution  and  the  invention  of  the  computer  have  generated  a 
huge  assortment  of  scheduling  problems,  arising  in  areas  such  as  manufacturing,  production 
planning  and  computer  control. 

The  most  studied  area  in  this  discipline  has  been  deterministic  machine  scheduling.  By 
deterministic  we  mean  that  all  of  the  information  that  defines  a  problem  instance  is  determined 
with  certainty  in  advance.  By  machine  scheduling  we  mean  that  the  resource  to  be  scheduled 
is  either  one  or  a  set  of  machines,  where  a  machine  can  perform  at  most  one  activity  at  any 
time.  There  are  a  staggering  number  of  different  problems  in  this  field:  an  expert  system  for 
the  classification  of  these  problems  recognizes  4,536  different  scheduling  problems,  of  which 
3,817  have  been  proved  AfP-hard,  416  are  known  to  be  solvable  in  polynomial  time,  and  303 
are  still  open  [78]. 

Despite  its  size,  this  huge  collection  of  problems  can  be  neatly  organized  according  to  the 
following  three  criteria:  the  machine  environment,  the  job  characteristics  and  the  optimality 
criterion.  An  instance  of  a  problem  is  specified  by  a  machine  environment,  a  set  of  jobs  that 
may  have  some  specified  characteristics,  and  an  optimality  criterion.  Each  job  in  the  set  has 
specific  processing  requirements  on  one  or  several  of  the  machines;  the  goal  is  to  produce  a 
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schedule  of  jobs  on  machines  that  achieves  the  optimal  value  of  the  optimality  criterion. 

The  different  types  of  machine  environments  can  be  classified  into  three  basic  categories: 
the  single  machine  environment,  the  parallel  machine  environment  and  the  shop  environment. 
In  this  thesis  we  will  focus  on  the  latter  two  environments.  This  focus  is  to  a  large  extent 
motivated  by  the  connections  between  these  environments  and  parallel  processing.  We  will  also 
focus  almost  exclusively  on  the  optimality  criterion  of  “completion  time”;  i.e.  the  goal  of  the 
scheduler  will  be  to  produce  the  “shortest”  schedule  -  a  schedule  that  minimizes  the  completion 
time  of  the  last  job  to  complete.  Other  typical  criteria  that  have  been  considered  in  the  literature 
are  the  sum  of  the  completion  times  of  the  jobs,  the  weighted  sum  of  the  completion  times  of 
the  jobs,  and  the  maximum  lateness  of  a  job,  where  each  job  has  an  associated  due  date. 
Approximation  Algorithms:  Most  of  the  scheduling  problems  we  will  be  considering  are 
WP- Complete,  and  therefore  we  will  focus  on  obtaining  polynomial-time  algorithms  for  these 
problems  that  give  good  approximations  to  the  optimal  solution.  We  will  evaluate  an 
approximation  algorithm  in  terms  of  its  performance  guarantee,  or  in  other  words,  its  worst-case 
relative  error.  Let  C^ax  be  the  maximum-completion  time  of  a  job  in  the  optimal  solution.  If 
a  polynomial-time  algorithm  always  delivers  a  solution  of  maximum-completion  time  at  most 
pC „ax,  then  we  shall  call  it  a  p-approximation  algorithm. 

1.1  The  Shop  Environment 

Introduction 

Shop  scheduling  refers  to  a  large  class  of  problems  that  typically  arise  in  a  shop,  factory  or 
assembly  line  setting.  The  shop  has  m  machines,  and  in  the  basic  environment  each  machine 
is  different  and  performs  a  different  function.  Each  job  consists  of  a  set  of  operations ,  each  of 
which  must  be  processed  on  a  particular  machine;  a  job  may  have  more  than  one  operation  on 
a  particular  machine.  We  wish  to  produce  a  schedule  which  assigns  a  period  of  time  to  each 
operation  during  which  it  is  processed  on  the  appropriate  machine.  The  goal  is  to  minimize 
the  completion  time  of  the  last  operation  to  complete,  while  ensuring  that  no  more  than  one 
operation  is  assigned  to  a  machine  at  any  point  in  time  and  no  two  operations  of  the  same  job 
are  scheduled  simultaneously. 
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A  variety  of  constraints  may  be  introduced  on  the  order  of  execution  of  the  operations  of 
the  job,  and  different  sorts  of  constraints  yield  different  well-known  versions  of  the  problem. 
(Note  that  we  only  focus  on  order  constraints  between  the  operations  of  each  job,  and  not 
between  operations  of  different  jobs.)  For  example,  if  we  impose  a  strict  total  order  on  the 
order  of  execution  of  the  operations  of  a  job,  the  problem  is  a  job  shop  scheduling  problem.  If 
the  total  order  is  the  same  total  order  for  every  job,  and  each  job  has  at  most  one  operation  on 
each  machine,  we  have  a  flow  shop  scheduling  problem.  If  there  is  no  order  at  all  imposed  on 
the  execution  of  any  job’s  operations,  we  have  an  open  shop  problem.  It  is  traditional  in  the 
scheduling  literature  to  focus,  for  the  open  shop  problem,  on  the  case  when  each  job  is  processed 
on  each  machine  at  most  once  (since  operations  on  the  same  machine  can  be  coalesced).  We 
will  refer  to  the  general  shop  scheduling  problem  that  does  not  fall  into  one  of  the  three  above 
categories  as  the  dag  shop  problem. 

In  this  thesis  we  will  concentrate  primarily  on  the  job  shop  scheduling  problem,  for  two 
reasons.  First  of  all,  most  of  our  results  for  other  shop  problems  can  be  obtained  as  easy 
corollaries  of  our  results  for  the  job  shop  problem.  Secondly,  the  job  shop  problem  is  probably 
the  most  famous  and  most  difficult  of  all  the  versions  of  the  problem.  It  is  strongly  AfP-hard; 
furthermore,  except  for  the  cases  when  there  are  two  jobs  or  when  there  are  two  machines 
and  each  job  has  at  most  two  operations,  essentially  all  special  cases  of  this  problem  are  j\fV~ 
hard,  and  typically  strongly  WP-hard  [44,  81).  For  example,  it  is  A/T-hard  even  if  there  are  3 
machines,  3  jobs  and  each  operation  is  of  unit  length;  note  that  in  this  case  we  can  think  of  the 
input  length  as  the  maximum  number  of  operations  in  a  job,  p. 

In  addition  to  this  theoretical  evidence  of  the  difficulty  of  the  job  shop  problem,  it  is  also  one 
of  the  most  notoriously  difficult  AfP-hard  optimization  problems  in  terms  of  practical  compu¬ 
tation,  with  even  very  small  instances  being  difficult  to  solve  exactly.  A  striking  example  of  this 
is  that  a  single  instance  of  the  problem  involving  only  10  jobs,  10  machines  and  100  operations, 
which  first  appeared  in  a  book  by  Muth  and  Thompson  in  1963,  remained  unsolved  for  23  years 
despite  repeated  attempts  to  find  an  optimal  solution  [81].  Today,  due  to  better  algorithms  and 
faster  machines,  instances  with  10  jobs  and  10  machines  seem  to  be  tractable  -  Applegate  and 
Cook  solved  ten  different  10  x  10  problems,  including  the  notorious  instance  mentioned  above, 
in  times  ranging  from  90  seconds  to  42  minutes.  (It  is  interesting  to  note  that  the  instance  of 
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Muth  and  Thompson  was  one  of  the  easier  instances  to  solve  using  their  technique).  However, 
slightly  larger  instances  are  still  currently  intractable;  they  report  instances  of  size  10  X  15, 
15  x  20, 15  x  15  and  10  x  20  that  they  were  unable  to  solve  [3], 

Formal  Definition  and  Previous  Results 

We  formally  define  the  job  shop  problem  as  follows.  We  are  given  a  set  M  =  {mi,  m2, . . . ,  rnm} 
of  machines,  a  set  J  =  {Ji, ...,</„}  of  jobs,  and  a  set  O  =  {0,y  |i  =  1, J  = 
of  operations,  where  Ky  indexes  the  machine  on  which  operation  Oy  runs.  Thus  m  is  the 
number  of  machines,  n  is  the  number  of  jobs,  p;-  is  the  number  of  operations  of  job  J),  and 
ji  =  ma XjUj.  Oij  is  the  tth  operation  of  Jj\  it  requires  processing  time  on  a  given  machine 
m*  €  M,  where  k  =  Ky,  for  an  uninterrupted  period  of  a  given  length  py.  (In  other  words,  this 
is  a  non-preemptive  model;  a  model  in  which  operations  may  be  interrupted  and  resumed  at  a 
later  time  is  called  a  preemptive  model.)  Each  machine  can  process  at  most  one  operation  at  a 
time,  and  each  job  may  be  processed  by  at  most  one  machine  at  a  time.  If  the  completion  time 
of  operation  0y  is  denoted  by  Cy,  then  the  objective  is  to  produce  a  schedule  that  minimizes 
the  maximum-completion  time,  CmAX  =  maxy  Cy;  the  optimal  value  is  denoted  by 

It  is  possible  to  extend  this  model  by  associating  with  each  job  J3  a  release  date  rj ,  on  which 
Jj  becomes  available  for  processing.  This  extension  does  not  affect  most  of  our  performance 
bounds,  and  therefore  unless  explicitly  stated  otherwise  we  assume  that  all  jobs  are  available 
for  processing  at  time  0. 

The  formal  definition  of  the  flow,  open  or  dag  shop  problems  would  be  almost  the  same, 
except  for  the  following  small  differences: 

•  flow  shop:  K{j  =  K{j>,  for  all  and  «y  ^  «,»;  for  all  i,  i',j. 

•  open  shop:  The  Oy  can  be  processed  in  any  order. 

•  dag  shop:  For  each  job  j  we  define  a  partial  order  on  the  Oy  and  require  that  they  be 
processed  in  any  total  order  consistent  with  that  partial  order. 

There  are  two  very  easy  lower  bounds  on  the  length  of  an  optimum  schedule.  Since  each  job 
must  be  processed,  C^ax  must  be  at  least  the  maximum  total  length  of  any  job,  max/,  ^py, 
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which  we  shall  call  the  maximum  job  length  of  the  instance,  Furthermore,  each  machine 
must  process  all  of  its  operations,  and  so  C„ax  must  be  at  least  maxm„  £*„•_*  P.;,  which  we 
will  call  the  maximum  machine  load  of  the  instance,  nmaie.  Note  that  these  are  lower  bounds 
regardless  of  whether  we  have  a  job,  flow,  open  or  dag  shop  problem. 

There  has  been  a  tremendous  amount  of  literature  on  shop  scheduling  problems  over  the 
last  thirty  years  [81].  We  mentioned  earlier  that  all  but  the  most  restrictive  versions  of  the  job 
shop  problem  are  W’P-hard;  this  is  also  true  of  the  other  versions  of  the  problem.  When  there 
are  at  least  3  machines  both  the  open  and  flow  shop  problems  are  A/T-hard  [81].  When  there 
are  just  two  machines  both  these  problems  are  known  to  be  in  V  [66,  50].  In  contrast,  the  two 
machine  job  shop  problem  is  only  known  to  be  polynomial-time  solvable  if  each  job  has  at  most 
two  operations,  or  if  each  operation  is  of  unit  size  [81]. 

Despite  all  the  attention,  however,  surprisingly  little  has  been  known  about  approxima¬ 
tion  algorithms  for  shop  scheduling  problems.  In  fact,  all  that  was  known  was  the  following 
observation  by  Gonzales  and  Sahni: 

Theorem  1.1.1  [51]  An  algorithm  A  for  the  job  shop  problem  that  produces  a  schedule  in  which 
at  least  one  machine  is  running  at  any  point  in  time  is  an  m-approximation  algorithm. 

Proof :  The  length  of  the  schedule  produced  by  such  an  algorithm  Cmax(>l)  is  bounded  above 
by  since  some  operation  is  always  being  executed.  On  the  other  hand,  the  average 

machine  load,  ^2ijPij/m ,  is  a  lower  bound  on  the  maximum  machine  load ,  which  is  a  lower 
bound.  The  theorem  follows  directly.  ■ 

Little  was  also  known  in  the  way  of  negative  results  -  results  that  indicate  it  is  difficult 
to  approximate  these  problems.  Recently,  however,  David  Williamson  has  shown  that  unless 
V  =  MV  none  of  these  problems  can  be  approximated  arbitrarily  closely. 

Theorem  1.1.2  [124]  Unless  V  =  MV, 

•  There  is  no  polynomial  time  algorithm  that  approximates  the  job  shop  problem  or  the  flow 
shop  problem  within  a  factor  of  j|. 

•  There  is  no  polynomial  time  algorithm  that  approximates  the  open  shop  problem  within  a 
factor  of  |. 
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Despite  the  lack  of  knowledge  about  approximation  algorithms  with  good  worst-case  relative 
error  guarantees,  there  are  two  relevant  results  that  are  important  to  our  work.  The  most 
interesting  approximation  algorithms  to  date  for  job  shop  scheduling  have  primarily  appeared 
in  the  Soviet  literature  and  are  based  on  a  beautiful  connection  to  geometric  arguments.  This 
approach  was  independently  discovered  by  Belov  and  Stolin  [0]  and  Sevast’yanov  [108]  as  well  as 
by  Fiala  [39].  This  approach  typically  produces  schedules  for  which  the  length  can  be  bounded 
by  nmax  +  q(m,  fi)pmiX,  where  '?(•,•)  is  a  polynomial,  and  pmax  =  max,;-  p,-y  is  the  maximum 
operation  length.  For  the  job  shop  problem,  Sevast’yanov  [109,  110]  gave  a  polynomial-time 
algorithm  that  delivers  a  schedule  of  length  at  most  IImax  +  0(mp3)pmax.  The  bounds  obtained 
in  this  way  do  not  give  good  worst-case  relative  error  bounds.  Even  for  the  special  case  of  the 
flow  shop  problem,  the  best  known  algorithms  delivered  solutions  of  length  D(mC^ax). 

Since  these  results  are  not  well  known  in  the  West,  and  are  important  tools  for  us,  we 
provide  here  a  bit  of  information  about  the  proof  of  the  flow  shop  result,  which  is  simpler  than 
the  more  general  job  shop  result.  This  simpler  presentation  of  the  proof  is  due  to  David  Shmoys 
[114]. 

Theorem  1.1.3  There  exists  a  polynomial  time  algorithm  A  for  the  flow  shop  problem  which 
yields  a  schedule  of  length  bounded  above  by  (7^ax  +  m(m  -  l)pmax. 

Proof : 

The  proof  relies  heavily  on  the  following  lemma. 

Lemma  1.1.4  Let  {tq,  v2, . . .,  v„}  be  a  set  of  d-dimensional  vectors  such  that  Y%=ivj  =  0* 
There  exists  a  polynomial-time  algorithm  that  computes  a  permutation  7 r  such  that  for  any  k  — 
1, . .  .,n,  ||  v„(j)||  <  drnaxy  ||v;-||,  where  we  use  ||x||  to  denote  the  Lj-norm  of  x. 

Without  loss  of  generality,  we  can  assume  that  the  load  on  each  machine  is  equal  to  the 
maximum  machine  load,  namely  IImax.  In  this  case  the  completion  time  of  the  schedule  is 
nmax  +  /,  where  I  is  the  amount  of  idle  time  on  the  last  machine  before  it  starts  processing 
the  last  operation  of  the  last  job  to  complete  on  it.  If  we  choose  a  permutation  n  of  the  n  jobs 
and  schedule  their  operations  in  that  order  on  every  machine,  it  is  not  hard  to  see  that  the 
condition  -  P<>(o)  <  ("*  -  1)?W  would  yield  an  upper  bound  of  m(m  -  l)pmax 
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Now  if  we  construct  a  set  of  n  m-dimensional  vectors  vj,  where  vj  =  (ptj  -  P2j,Pij  - 
P3j » •  •  •  >  Pm- 1  j  ~Pmj ),  the  algorithm  mentioned  in  the  previous  lemma  will  produce  the  necessary 
permutation.  ■ 

Another  important  result  on  shop  scheduling  comes,  somewhat  surprisingly,  from  the  litera¬ 
ture  on  packet  routing.  Leighton,  Maggs  and  Rao  [82]  have  proposed  the  following  model  for  the 
routing  of  packets  in  a  network:  find  paths  for  the  packets  and  then  schedule  the  transmission 
of  the  packets  along  these  paths  so  that  no  two  packets  traverse  the  same  edge  simultaneously. 
The  primary  objective  is  to  minimize  the  time  by  which  all  packets  have  been  delivered  to  their 
destination. 

It  is  easy  to  see  that  the  scheduling  problem  considered  by  Leighton,  Maggs  and  Rao  is 
simply  the  job  shop  scheduling  problem  with  each  processing  time  py  =  1.  They  also  added 
the  restriction  that  each  path  does  not  traverse  any  edge  more  than  once,  or  in  scheduling 
terminology,  each  job  has  at  most  one  operation  on  each  machine.  This  restriction  of  the  job 
shop  problem  remains  (strongly)  WP-hard  [81].  The  main  result  of  Leighton,  Maggs  and  Rao 
was  to  show  that  for  their  special  case  of  the  job  shop  problem,  there  always  exists  a  schedule 
of  length  0(nmax  +  Pm**).  Unfortunately,  this  is  not  an  algorithmic  result,  as  it  relies  on  a 
nonconstructive  probabilistic  argument  based  on  the  Lovasz  Local  Lemma.  They  .also  obtained 
a  randomized  algorithm  that  delivers  a  schedule  of  length  0(nmax  +  Pm axl°g?0>  with  high 
probability. 

Brief  Statement  of  Our  Results:  We  give  a  randomized  approximation  al¬ 

gorithm  and  a  deterministic  O(log2(mp))-approximation  algorithm  for  the  job  shop  schedul¬ 
ing  problem,  and  a  2-approximation  algorithm  for  open  shop  scheduling.  All  of  these  sig¬ 
nificantly  improve  on  the  previous  best  bounds  of  m.  We  also  give  two  parallel  algorithms: 
a  TZJVC  0( ^approximation  algorithm  for  job  shop  scheduling  and  an  JVC  O(logn)- 
approximation  algorithm  for  open  shop  scheduling.  Our  results  for  job  shop  scheduling  extend 
to  a  number  of  important  generalizations  of  the  basic  job  shop  problem. 
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l. 2  The  Parallel  Machine  Environment 

In  contrast  to  the  shop  environment,  where  a  job  consists  of  multiple  operations,  each  of  which 
must  be  processed  by  a  particular  machine,  in  the  parallel  machine  environment  each  job  has 
only  one  operation,  which  may  be  processed  by  any  machine.  An  instance  consists  of  n  jobs 
and  m  machines.  Each  machine  can  process  at  most  one  job  at  a  time,  and  each  job  must 
be  processed  in  an  uninterrupted  fashion  on  one  of  the  machines.  Typically  one  of  three 
assumptions  is  made  about  the  relative  powers  of  the  machines.  In  the  most  general  setting, 
the  machines  are  unrelated:  job  Jj  takes  py  =  Pj/sij  time  units  when  processed  by  machine 

m, -,  where  pj  is  the  processing  requirement  (or  size)  of  job  Jj  and  sy  is  the  speed  of  machine  m,- 
on  job  Jj.  If  the  machines  are  uniformly  related ,  then  each  machine  m,-  runs  at  a  given  speed 
Si  for  all  jobs  Jj,  and  the  processing  time  py  is  given  by  Pj/s,.  Finally,  for  identical  machines, 
we  assume  that  s,  =  1  for  each  machine  m,-.  If  Cj  denotes  the  time  at  which  job  Jj  completes 
processing  in  a  schedule,  then  the  makespan  or  length  of  the  schedule  is  Cmax  =  max;  Cj.  For  a 
given  instance  I,  our  objective  is,  once  again,  to  find  a  schedule  of  minimum  length  C^ax(Z). 

In  an  off-line  setting,  these  three  types  of  parallel  machine  models  have  been  studied  ex¬ 
tensively.  The  associated  scheduling  problems  are  all  strongly  AfP-hard  [44],  and  polynomial 
approximation  schemes  are  known  when  the  machines  are  either  identical  or  uniformly  related 
[60,  61].  For  unrelated  machines,  obtaining  a  solution  better  than  (3/2 )C^ax  is  A/T-hard, 
whereas  a  schedule  of  length  at  most  2 C^ax  can  be  found  in  polynomial  time  [83].  We  will 
be  interested  in  on-line  algorithms  to  solve  these  problems,  algorithms  that  produce  a  good 
schedule  without  knowing  the  size  of  a  job  until  it  is  finished  being  processed  and  not  knowing 
what  jobs  will  arrive  in  the  future.  We  put  'off  a  discussion  of  the  motivation  for  and  precise 
definitions  of  on-line  algorithms  until  Chapter  3. 

We  will  also  consider  the  preemptive  versions  of  these  models,  in  which  a  job  may  be  in¬ 
terrupted  on  one  machine  and  continued  later  (possibly  on  another  machine)  without  penalty. 
In  each  of  these  three  models,  there  is  a  polynomial-time  off-line  algorithm  to  find  an  optimal 
preemptive  solution  [89,  62,  79], 

Brief  Summary  of  Our  Results:  We  introduce  two  rather  general  techniques  for  converting 
scheduling  algorithms  that  need  more  complete  knowledge  of  the  input  data  into  ones  that 
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need  less  advance  knowledge.  Using  these  techniques  we  give  deterministic  on-line  scheduling 
algorithms  for  all  the  parallel  machine  models  we  have  described.  For  all  except  the  unrelated 
machines  models  we  can  prove  that  these  algorithms  are  asymptotically  optimal.  We  also 
give  an  on-line  algorithm  for  the  non-preemptive  scheduling  of  uniformly  related  machines  that 
takes  advantage  of  the  structure  of  the  distribution  of  speeds  amongst  the  machines,  and  we 
prove  that  this  algorithm  is  asymptotically  optimal  as  well.  We  also  prove  a  surprisingly  strong 
lower  bound  on  the  performance  of  any  randomized  on-line  algorithm  for  the  non-preemptive 
scheduling  of  identical  machines. 


1.3  Scheduling  Problems  and  Parallel  Computation 

Efficiently  scheduling  a  set  of  tasks  on  a  set  of  machines  is  a  basic  and  important  problem  in 
scheduling  theory  and  in  combinatorial  optimization.  It  also,  in  some  ways,  captures  the  essence 
of  the  challenge  of  parallel  computation.  The  models  we  have  discussed  in  this  section  are  the 
traditional  models  of  combinatorial  scheduling  theory,  but  are  simplifications  of  useful  models 
for  real  parallel  machines.  We  therefore  present  here  a  small  sample  of  scheduling  models  that 
are  related  to  those  we  consider  but  capture  in  more  detail  the  problems  involved  in  designing 
parallel  algorithms  and  systems.  We  have  discussed  earlier  the  relationship  between  the  shop 
scheduling  model  and  packet-routing  problems,  so  in  this  section  we  focus  on  the  parallel 
machine  model. 

In  a  theoretical  result,  Papdimitriou  and  Yannakakis  proposed  a  scheduling  model  as  a 
tool  for  architecture-independent  analysis  of  parallel  algorithms  [93].  They  assert  that  the 
performance  analysis  of  a  parallel  algorithm  for  a  problem  consists  of  at  least  four  stages:(l) 
Choosing  the  algorithm,  which  is  described  by  a  directed  acyclic  graph  (dag)  that  captures  the 
elementary  computations  and  their  interdependence.  The  constraints  imposed  by  the  dag  are 
often  called  precedence  constraints.  (2)  Choosing  a  particular  multiprocessor  architecture.  (3) 
Finding  a  schedule  whereby  the  algorithm  is  executed  on  that  architecture,  so  that  the  necessary 
data  are  available  to  the  appropriate  processor  at  the  time  of  each  computation.  (4)  At  this 
point  it  is  appropriate  to  discuss  the  performance  of  the  algorithm,  which  is  characterized  by 
the  makespan  of  the  schedule. 
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They  introduce  the  parameter  r,  the  communication  delay  between  the  time  some  informa¬ 
tion  is  produced  at  a  processor  and  the  time  that  it  can  be  used  at  another  processor.  Assuming 
that  there  are  enough  processors  to  handle  the  width  of  the  dag,  they  give  an  approximation  al¬ 
gorithm  that  for  any  r  and  any  dag  comes  within  a  factor  of  2  of  the  optimal  makespan.  Despite 
the  assumption  that  disregards  the  number  of  processors  the  technique  does  not  necessarily  pro¬ 
duce  unrealistic  processor-wasteful  algorithms.  In  fact  they  are  able  to  derive  asymptotic  upper 
and  lower  bounds  on  the  parallel  complexity  of  several  algorithms  that  are  processor-optimal 
for  fixed  time,  and  yield  quite  intuitive  algorithms. 

Berger  and  Cowen  considered  a  more  general  model  of  program  dependence  than  a  simple 
dag.  A  dag  captures  only  one  sort  of  dependency:  that  computation  i  must  be  performed  before 
computation  j.  Draper  [30]  gives  examples  where  the  ability  to  require  concurrent  scheduling  of 
two  computations  or  to  require  only  that  one  computation  is  performed  before  or  su.  iltaneously 
with  another  operation  would  increase  the  ability  to  exploit  the  parallelism  of  a  parallel  system. 
They  therefore  consider  the  problem  of  scheduling  unit-size  jobs  on  m  identical  machines  when 
the  execution  of  the  jobs  can  be  constrained  by  all  of  the  types  of  constraints  they  describe. 
Their  results  apply  to  a  problem  that  arises  in  practice  on  the  Horizon  architecture  [8]. 

Feitelson  and  Rudolph  [36,  37]  argue  that  a  central  theme  of  parallel  processing  is  the 
existence  of  many  interacting  threads  of  control  that  cooperate  to  perform  a  single  computation. 
The  threads  can  be  created  throughout  the  computation  and  execute  for  different  amounts  of 
times.  Since  threads  may  need  to  interact  there  may  be  a  need  for  synchronization;  furthermore 
if  the  number  of  threads  exceeds  the  number  of  processors  a  naive  approach  may  leave  threads 
sitting  idle  while  they  wait  for  other  threads  to  be  processed.  A  solution  they  suggest  and  are 
implementing  is  gang  scheduling,  where  one  guarantees  tnat  a  set  of  interacting  threads  executes 
simultaneously.  The  model  of  gang  scheduling  is  similar  to  our  identical  machine  model,  but 
jobs  now  also  have  a  width  which  is  the  size  of  the  block  of  processors  on  which  the  job  must 
be  executed.  They  argue  that  the  scheduling  must  be  done  in  an  on-line  fashion,  and  study  a 
number  of  different  rules  under  various  distributions. 
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1 .4  Notation  Summary 

Due  to  the  large  number  of  different  and  important  versions  of  scheduling  problems,  there  is  a 
fair  bit  of  notation  associated  with  the  field.  We  close  this  chapter  with  a  list  of  the  scheduling 
notation  used  in  this  thesis,  organized  in  such  a  fashion  as  to  summarize  the  similarities  and 
differences  between  the  different  models. 

List  of  Notation 


Common  to  both  environments: 

•  m:  the  number  of  machines. 

•  n:  the  number  of  jobs. 

•  J},  1  <  J  <  n:  the  j th  job. 

•  ro»>  1  <  t <  m:  the  ith  machine. 

•  I:  a  problem  instance. 

•  A:  an  algorithm. 

•  QiaxOO  the  length  (makespan)  of  the  shortest  feasible  schedule  for  instance  Z. 

•  CmaxM):  length  (makespan)  of  the  schedule  produced  by  algorithm  A. 

Shop  Scheduling: 

•  Oiji  ith  operation  of  job  j. 

•  p:  the  maximum  number  of  operations  in  any  job. 

•  Kif.  the  machine  on  which  operation  Op  must  be  processed. 

•  nmax:  the  maximum  machine  load. 

•  pij'.  the  size  of  operation  0,-y. 


•  Pmaxi  the  maximum  operation  size. 
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•  -Fmax*  the  maximum  total  job  size. 

Parallel  Machine  Scheduling: 

•  py.  size  of  job  j. 

•  3(‘,  speed  of  machine  i  (Uniformly  related  machines). 

•  S{j :  speed  of  machine  i  on  job  j  (Unrelated  machines). 

•  Pmnx-  the  maximum  job  size. 

As  a  final  point  of  notation,  except  where  otherwise  specified  log  x  will  always  refer  to  the 
logarithm  base  2  of  x. 


Chapter  2 


Approximation  Algorithms  for  Shop 
Scheduling1 


2.1  Introduction 

As  we  have  discussed  in  Chapter  1,  despite  over  thirty  years  of  work  on  shop  scheduling,  very 
little  has  been  known  about  approximation  algorithms  for  these  problems.  lit  this  chapter  we 
present  the  first  nontrivial  approximation  algorithms  for  these  scheduling  problems.  Our  most 
important  contribution  is  the  first  randomized  polynomial-time  polylogarithmic  approximation 
algorithm  for  job  shop  scheduling.  This  algorithm  builds  on  and  generalizes  the  framework 
established  by  Leighton,  Maggs  and  Rao  for  the  special  case  of  unit  operation  sizes  and  at  most 
one  operation  per  job  per  machine  [82]. 

Theorem  2.1.1  There  exists  a  polynomial-time  randomized  algorithm  for  job  shop  scheduling, 
that,  with  high  probability,  yields  a  schedule  that  is  of  length 

We  also  give  a  deterministic  version  of  this  algorithm  with  almost  the  same  performance 
guarantee. 

Theorem  2.1.2  There  exists  a  deterministic  polynomial-time  algorithm  for  job  shop  scheduling 
which  finds  a  schedule  of  length  0(log2(m  •  /r)C^ax). 


1Thi8  chapter  describes  joint  work  with  David  Shmoys  and  Cliff  Stein  [112]. 
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As  a  corollary,  we  also  obtain  a  deterministic  version  of  the  randomized  algorithm  of 
Leighton,  Maggs  and  Rao.  Our  deterministic  algorithm  relies  on  results  of  Raghavan  and 
Thompson  [102]  and  Raghavan  [99]  to  approximate  certain  integer  packing  problems.  Note 
that  if  each  job  must  be  processed  on  each  machine  at  most  once,  the  factor  of  p  can  be  deleted 
for  this,  and  all  other  performance  guarantees  in  this  chapter 

Our  techniques  are  not  only  useful  for  the  job  shop  problem,  but  can  easily  be  extended  to 
the  general  problem  of  dag  shop  scheduling.  Another  important  generalization  is  the  situation, 
where,  rather  than  having  m  different  machines,  there  are  m'  types  of  machines,  and  for  each 
type,  there  are  a  specified  number  of  identical  machines;  each  operation,  rather  than  being 
assigned  to  one  machine,  may  be  processed  on  any  machine  of  the  appropriate  type.  These 
problems  have  significant  practical  importance,  since  in  real-world  shops  we  would  expect  that 
a  job  need  not  follow  a  total  order  and  that  the  shop  would  have  more  than  one  copy  of  many  of 
their  machines.  We  will  give  approximation  algorithms  with  the  same  performance  guarantees 
for  this  generalization  as  well. 

When  m  and  p  are  constants  we  can  achieve  much  better  approximation  guarantees  -  we  give 
a  (2  +  e)-approximation  algorithm  for  this  special  case.  Finally,  we  give  parallel  approximation 
algorithms  for  all  the  scheduling  models  mentioned  above,  and  some  improved  results  for  the 
open  shop  problem. 

While  all  of  the  algorithms  that  we  give  are  polynomial-time,  they  are  all  rather  inefficient. 
Most  rely  on  the  algorithms  of  Sevast’yanov;  for  example,  his  algorithm  for  job  shop  scheduling 
takes  0((pmn)2)  time.  Furthermore,  the  deterministic  versions  rely  on  linear  programming.  As 
a  result,  we  will  not  refer  explicitly  to  running  times  throughout  the  remainder  of  this  chapter. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  2.2  we  extend  the  basic  technique 
of  Leighton,  Maggs  and  Rao  to  the  general  job  shop  problem.  In  Section  2.3  we  show  how  to 
scale  and  reduce  the  input  data  so  that  the  techniques  of  Section  2.2  yield  good  performance 
bounds.  In  Section  2.4  we  show  how  our  techniques  apply  to  more  general  problems,  and  in 
2.5  we  show  how  to  make  them  deterministic.  We  conclude  with  a  discussion  of  the  open  shop 
problem  in  Section  2.6  and  some  open  problems  in  Section  2.7. 
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2.2  The  Basic  Algorithm 

In  this  section  we  extend  the  technique  due  to  Leighton,  Maggs  and  Rao  [82]  of  assigning 
random  delays  to  jobs  to  the  general  case  of  non-preemptive  job  shop  scheduling.  A  valid 
schedule  assigns  at  most  one  job  to  a  particular  machine  at  any  time,  and  schedules  each  job  on 
at  most  one  machine  at  any  time.  Their  approach,  for  the  special  case  of  unit-size  operations 
and  at  most  one  operation  of  each  job  on  each  machine,  was  to  first  create  a  schedule  that 
obeyed  only  the  second  constraint,  and  then  build  from  this  a  schedule  that  satisfies  both 
constraints  and  is  not  much  longer.  The  outline  of  the  strategy  follows: 

1.  Define  the  oblivious  schedule,  where  each  job  starts  running  at  time  0  and  runs  continu¬ 
ously  until  all  of  its  operations  have  been  completed.  This  schedule  is  of  length  Pmax,  but 
there  may  be  times  when  more  than  one  job  is  assigned  to  a  particular  machine. 

2.  Perturb  this  schedule  by  delaying  the  start  of  the  first  operation  of  each  job  by  a  random 
integral  amount  chosen  uniformly  in  [0,llmax/logn].  The  resulting  schedule,  with  high 
probability,  has  no  more  than  O(logn)  operations  assigned  to  any  machine  at  any  time. 

3.  Reschedule  each  unit  of  time  t  into  0(log  n)  units  of  time  during  which  each  of  the  0(log  n) 
operations  scheduled  for  time  t  is  processed.  The  resulting  (valid)  schedule  is  of  length 
0(Pmaxlogn  +  Umax)* 

Our  strategy  builds  upon  this  framework  of  Leighton,  Maggs  and  Rao.  Whereas  Step  2 
differs  in  only  a  few  technical  details,  the  essential  difficulty  in  obtaining  the  generalization  is 
in  Step  3. 

2.  Perturb  this  schedule  by  delaying  the  start  of  the  first  operation  of  each  job  by  a  ran¬ 

dom  integral  amount  chosen  uniformly  in  [0,nmax].  The  resulting  schedule,  with  high 
probability,  has  no  more  than  jobs  assigned  to  any  machine  at  any  time. 

3.  “Spread”  this  schedule  so  that  at  each  point  in  time  all  operations  currently  being  pro¬ 
cessed  have  the  same  size,  and  then  “flatten”  this  into  a  schedule  that  has  at  most  one 
job  per  machine  at  any  time. 
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For  the  analysis  of  Step  2,  we  assume  that  pmax  is  bounded  above  by  a  polynomial  in  n  and 
p;  in  the  next  section  we  will  show  how  to  remove  this  assumption.  As  is  usually  the  case,  we 
assume  that  n  >  m;  analogous  bounds  can  be  obtained  when  this  is  not  true. 

Lemma  2.2.1  Given  a  job  shop  instance  in  which  pmax  is  bounded  above  by  a  polynomial  in  n 
and  p,  the  strategy  of  delaying  each  job  an  initial  integral  amount  chosen  randomly  and  uniformly 
from  [0,  nmax]  and  then  processing  its  operations  in  sequence  will  yield  an  (invalid)  schedule  that 
is  of  length  at  most  nmax  +  Pm»x  and,  with  high  probability,  has  no  more  than  jobs 

scheduled  on  any  machine  during  any  unit  of  time. 

Proof:  Fix  a  time  t  and  a  machine  m{;  consider  p  =  Prob[at  ieast  r  units  of  processing  are 
scheduled  on  machine  i  at  time  <].  There  are  at  most  (n““)  wayu  to  choose  r  units  of  processing 
from  all  those  required  on  m,-.  If  we  focus  on  a  particular  one  of  these  r  units  and  a  specific 
time  t,  then  the  probability  that  it  is  scheduled  at  time  t  is  at  most  l/IImax,  since  we  selected  a 
delay  uniformly  at  random  from  among  IImax  possibilities.  If  all  r  units  are  from  different  jobs 
then  the  probability  that  they  are  all  scheduled  at  time  t  is  at  most  since  the  delays 

are  chosen  independently.  Other  ise,  the  probability  that  all  r  are  scheduled  then  is  0,  since 
it  is  impossible.  Therefore 


If  r  =  then  p  <  (np)~^k~lK  To  bound  the  probability  that  any  machine  at  any 

time  has  more  than  jobs  using  it,  multiply  p  by  Pmsx  +  IImax  for  the  number  of  time 

units  in  the  schedule,  and  by  m  for  the  number  of  machines.  Since  we  have  assumed  that  pmax 
is  bounded  by  a  polynomial  in  n  and  p,  Pmax  +  ITmax  is  as  well;  choosing  k  large  enough  yields 
that,  with  high  probability,  no  more  than  J°^s  are  scheduled  for  any  machine  during 

any  unit  of  time.  ■ 

In  the  special  case  of  unit-length  operations  treated  by  Leighton,  Maggs  and  Rao,  a  schedule 
S  of  length  L  that  has  at  most  c  jobs  scheduled  on  any  machine  at  any  unit  of  time  can  trivially 
be  “flattened"  into  a  valid  schedule  of  length  cL  by  replacing  one  unit  of  S' s  time  with  c  units 
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Figure  2.1:  Flattening  a  schedule  in  the  case  with  unit  length  operations. 

of  time  in  which  we  run  each  of  the  jobs  that  was  scheduled  for  that  time  unit.  (See  Figure 

2.1.) 

For  preemptive  job  shop  scheduling,  where  the  processing  of  an  operation  may  be  interrupted, 
each  unit  of  an  operation  can  be  treated  as  a  unit-length  operation  and  a  schedule  that  has 
multiple  operations  scheduled  simultaneously  on  a  machine  can  easily  be  flattened  into  a  valid 
schedule.  This  is  not  possible  for  non-preemptive  job  shop  scheduling,  and  in  fact  it  seems  to 
be  more  difficult  to  flatten  the  schedule  in  this  case.  We  give  an  algorithm  that  takes  a  schedule 
of  length  L  with  at  most  c  operations  scheduled  on  one  machine  at  any  time  and  produces  a 
schedule  of  length  0(cL  log pmax). 

Lemma  2.2.2  Given  a  schedule  So  of  length  L  that  has  at  most  c  jobs  scheduled  on  one  machine 
during  any  unit  of  time,  there  exists  a  polynomial-time  algorithm  that  produces  a  valid  schedule  of 
length  0(cZ.  log  pmax). 

Proof:  To  begin,  we  round  up  each  processing  time  pi}  to  the  next  power  of  2  and  denote 
the  rounded  times  by  p^-;  that  is,  p^  =  2f|og3Pi>l.  Let  p'max  =  max^py.  From  S0,  it  is  easy  to 
obtain  a  schedule  S  that  uses  the  modified  p't}  and  is  at  most  twice  as  long  as  50;  furthermore, 
an  optimal  schedule  for  the  new  problem  is  no  more  than  twice  as  long  as  an  optimal  schedule 
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for  the  original  problem. 

A  blockis  an  interval  of  a  schedule  with  the  property  that  each  operation  that  begins  during 
this  interval  is  of  length  no  more  than  that  of  the  entire  interval.  (Note  that  this  does  not  mean 
that  the  operation  finishes  within  the  interval.)  We  can  divide  S  into  l^r^l  consecutive  blocks 
of  size  We  will  give  a  recursive  algorithm  that  reschedules  -  “spreads”  -  each  block  of 
size  p  (where  p  is  a  power  of  2)  into  a  sequence  of  schedule  fragments  of  total  length  plogp;  the 
operations  scheduled  in  a  fragment  of  length  T  are  all  of  length  T ,  and  start  at  the  beginning 
of  the  fragment.  This  algorithm  takes  advantage  of  the  fact  that  if  an  operation  of  length  p  is 
scheduled  to  begin  in  a  block  of  size  p ,  then  that  job  is  not  scheduled  on  any  other  machine 
until  after  this  block.  Therefore,  that  operation  can  be  scheduled  to  start  after  all  of  the  smaller 
operations  in  the  block  finish. 

To  reschedule  a  block  B  of  size  p'mi we  first  construct  the  final  fragment  (which  is  of  length 
Pmax ) )  an^  then  construct  the  preceding  fragments  by  recursive  calls  of  the  algorithm.  For  each 
operation  of  length  p'mtiX  that  begins  in  B ,  reschedule  that  operation  to  start  at  the  beginning  of 
the  final  fragment,  and  delete  it  from  B.  Now  each  operation  that  still  starts  in  B  is  of  length 
at  most  p'msx/2,  so  B  can  be  subdivided  into  two  blocks,  Z?i  and  B2,  each  of  size  PmSX/2,  and 
we  can  recurse  on  each.  See  Figure  2.2. 

The  recurrence  equation  that  describes  the  total  length  of  the  fragments  produced  from  a 
block  of  size  T  is  f(T)  =  2/(f )  +  T;/(1)  =  1.  Thus  f(T)  =  0(T\ogT),  and  each  block  B  in  S 
of  size  p'max  is  spread  into  a  schedule  of  length  p'max  logp'max.  By  spreading  the  schedule  S,  we 
produce  a  new  schedule  S'  that  satisfies  the  following  conditions: 

1.  At  any  time  in  S',  all  operations  scheduled  are  of  the  same  length;  furthermore,  any  two 
operations  either  start  at  the  same  time  or  do  not  overlap. 

2.  If  S  has  at  most  c  jobs  scheduled  on  one  machine  at  any  time,  then  this  must  hold  for  S' 
as  well. 

3.  S'  schedules  a  job  on  at  most  one  machine  at  any  time. 

4.  S'  does  not  schedule  the  ith  operation  of  job  Jj  until  the  first  i  -  1  are  completed. 
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Figure  2.2:  (a)  The  initial  greedy  schedule  of  length  8.  =  4.  (b)  The  first  level  of 

spreading.  All  jobs  of  length  4  have  been  put  in  the  final  fragments.  We  must  now  recurse  on 
B\  and  B2  with  p'max  =  2.  (c)  The  final  schedule  of  length  8  log2  8  =  24. 


24 


CHAPTER  2.  APPROXIMATION  ALGORITHMS  FOR  SHOP  SCHEDULING 


Condition  1  is  satisfied  by  each  pair  of  operations  on  the  same  machine  by  the  definition 
of  spreading,  and  by  each  pair  of  operations  on  different  machines  because  the  division  of  time 
into  fragments  is  the  same  on  all  machines.  To  prove  condition  2,  note  that  operations  of  length 
T  that  are  scheduled  at  the  same  time  on  the  same  machine  in  the  expanded  schedule  started 
in  the  same  block  of  size  T  on  that  machine.  Since  they  all  must  have  been  scheduled  during 
the  last  unit  of  time  of  that  block,  there  can  be  at  most  c  of  them. 

To  prove  condition  3,  note  that  if  a  job  is  scheduled  by  S'  on  two  machines  simultaneously 
that  means  that  it  must  have  been  scheduled  by  S  to  start  two  operations  of  length  T  in  the 
same  block  of  length  T  on  two  different  machines.  This  means  it  was  scheduled  by  S  on  two 
machines  during  the  last  unit  of  time  of  that  block,  which  violates  the  properties  of  S. 

Finally  we  verify  condition  4  by  first  noting  that  if  two  operations  of  a  job  are  in  different 
blocks  of  size  in  S  then  they  are  certainly  rescheduled  in  the  correct  order.  Therefore 
it  suffices  to  focus  on  the  schedule  produced  from  one  block.  Within  a  block,  if  an  operation 
is  rescheduled  to  the  final  fragment  then  it  is  the  last  operation  for  that  job  in  that  block. 
Therefore  S'  does  not  schedule  the  ith  operation  of  job  Jj  until  the  first  i  -  1  are  completed. 

The  schedule  S'  can  easily  be  flattened  to  a  schedule  that  obeys  the  constraint  of  one  job 
per  machine  at  any  time,  since  c  operations  of  length  T  that  start  at  the  same  time  can  just  be 
executed  one  after  the  other  in  total  time  cT.  Note  that  since  what  we  are  doing  is  effectively 
synchronizing  the  entire  schedule  block  by  block,  it  is  important  when  flattening  the  schedule  to 
make  each  machine  wait  enough  time  for  all  machines  to  process  all  operations  of  that  fragment 
length,  even  if  some  machines  have  no  operations  of  that  length  in  that  fragment. 

The  schedule  S'  was  of  length  Tlogp^;  therefore  the  flattened  schedule  is  of  length 

^log P'max-  ’  ■ 

We  note  in  passing  that  the  inclusion  of  release  dates  into  the  problem  will  not  affect  the 
quality  of  our  bounds  at  all.  The  release  dates  can  either  be  directly  included  into  probabilistic 
analysis  of  lemma  2.2.1,  or  we  can  view  each  release  date  as  one  additional  initial  operation  on 
some  (imaginary)  machine. 


2.3.  REDUCING  THE  PROBLEM 
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2.3  Reducing  the  Problem 

In  the  previous  section  we  showed  how  to  produce,  with  high  probability,  a  schedule  of  length 

0  ((nmsx  +  Pm«)loglgo(g(nM)M)  logPmax)  , 

under  the  assumption  that  pmax  was  bounded  above  by  a  polynomial  in  n  and  p.  Since 

l^max  "f"  P -pax  =  0(max{nmax,  Pmax}) 

this  schedule  is  within  a  factor  of  log  Pmax)  of  optimality.  In  this  section,  we  will 

first  remove  the  assumption  that  pmax  is  bounded  above  by  a  polynomial  in  n  and  fi  by  showing 
that  we  can  reduce  the  general  problem  to  that  special  case  while  only  sacrificing  a  constant 
factor  in  the  approximation.  This  yields  an  0 ( J ^approximation  algorithm.  Then  we 
will  show  that  n  need  not  be  polynomially  bounded  in  m  and  p.  Combining  these  two  results, 
we  conclude  that  we  can  reduce  the  job  shop  problem  to  the  case  where  n  is  polynomially 
bounded  in  m  and  p,  while  changing  the  performance  guarantee  by  only  a  constant. 

2.3.1  Reducing  pmax 

First  we  will  show  that  we  can  reduce  the  problem  to  one  where  pmax  is  bounded  by  a  polynomial 
in  n  and  fi.  Let  u  =  \0\  be  the  total  number  of  required  operations.  Note  that  u  <  nfi.  Round 
down  each  py  to  the  nearest  multiple  of  pmax/w,  denoted  by  Py.  Now  there  are  at  most  u 
distinct  values  of  py  and  they  are  all  multiples  of  pmax/u>.  Therefore  we  can  treat  the  py  as 
integers  in  {0, . .  .,w};  a  schedule  for  this  problem  can  be  trivially  rescaled  to  a  schedule  S'  for 
the  actual  py.  (Note  that  assigning  py  =  0  does  not  mean  that  this  operation  does  not  exist; 
instead,  it  should  viewed  as  an  operation  that  takes  an  arbitrarily  small  amount  of  time.)  Let 
L  denote  the  length  of  S'.  We  claim  that  S'  for  this  reduced  problem  can  be  interpreted  as  a 
schedule  for  the  original  operations  that  will  be  of  length  at  most  L+pmax.  When  we  adjust  the 
Py  up  to  the  original  py,  we  add  an  amount  that  is  at  most  pmax/w  to  each  py.  Since  the  length 
of  a  schedule  is  determined  by  a  critical  path  through  the  operations  and  there  are  u>  operations, 
we  add  a  total  amount  of  at  most  pmax  to  the  length  of  the  schedule;  thus  the  new  schedule  is 


26 


CHAPTER  2.  APPROXIMATION  ALGORITHMS  FOR  SHOP  SCHEDULING 


of  length  at  most  L  +  pmax  <  L  +  C^ax.  Therefore  we  have  rounded  a  general  instance  X  of 
the  job  shop  problem  to  an  instance  X'  that  can  be  treated  as  having  pmax  =  0(np);  further,  a 
schedule  for  V  yields  a  schedule  for  X  that  is  no  more  than  C *  M  longer.  Thus  we  have  shown: 

Lemma  2.3.1  There  exists  a  polynomial-time  algorithm  which  transforms  any  instance  of  the 
job  shop  scheduling  problem  into  one  with  pmax  =  0(np)  with  the  property  that  a  schedule  for 
the  modified  instance  of  length  kC can  be  converted  in  polynomial  time  to  a  schedule  for  the 
original  instance  of  length  ( k  -f  1)C* 

2.3.2  Reducing  the  Number  of  Jobs 

To  reduce  an  arbitrary  instance  of  job  shop  scheduling  to  one  with  a  number  of  jobs  polynomial 
in  m  and  p  we  divide  the  jobs  into  big  and  small  jobs.  We  say  that  job  Jj  is  big  if  it  li£3  an 
operation  of  length  more  than  IImax/(2mp3);  otherwise  we  call  the  job  small.  For  the  instance 
consisting  of  just  the  short  jobs,  let  II'max  and  p'max  denote  the  maximum  machine  load  and 
operation  length,  respectively.  Using  the  algorithm  of  [110]  described  in  the  introduction,  we 
can,  in  time  polynomial  in  the  input  size,  produce  a  schedule  of  length  n'milx  +  2mp3p|n(ix  for 
this  instance.  Since  p'max  is  at  most  IImax/(2mp3)  and  II'max  <  IImax,  we  get  a  schedule  that  is 
of  length  no  more  than  2IImax.  Thus,  an  algorithm  that  produces  a  schedule  for  the  long  jobs 
that  is  within  a  factor  of  k  of  optimal  will  yield  a  ( k  +  2)-approximation  algorithm.  Note  that 
there  can  be  at  most  2 m2p3  long  jobs,  since  otherwise  there  would  be  more  than  mllmax  units  of 
processing  to  be  divided  amongst  m  machines,  which  contradicts  the  definition  of  nmax.  Thus 
we  have  shown: 

Lemma  2.3.2  There  exists  a  polynomial-time  algorithm  which  transforms  any  instance  of  the 
job  shop  scheduling  problem  into  one  with  0(rn2p3)  jobs  with  the  property  that  a  schedule  for  the 
modified  instance  of  length  kC ^ax  can  be  converted  in  polynomial  time  to  a  schedule  for  the  original 
instance  of  length  ( k  +  2)C^ax. 

From  the  results  of  the  previous  two  sections,  we  can  conclude  that: 

Theorem  2.3.3  There  exists  a  polynomial-time  randomized  algorithm  for  job  shop  scheduling, 
that,  with  high  probability,  yields  a  schedule  that  is  of  length  at  most  C^ax)- 
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Proof:  In  Section  2  we  showed  how  to  produce  a  schedule  of  length 

0  ((nm8X  +  ^)log^g(n^}  logPmax) 

under  the  assumption  that  pmsx  was  bounded  above  by  a  polynomial  in  n  and  (i.  From  Lemmas 

2.3.1  and  2.3.2  we  know  that  we  can  reduce  the  problem  to  one  where  n  and  pmax  are  poly¬ 

nomial  in  m  and  /i,  while  adding  only  a  constant  to  the  factor  of  approximation.  Since  now 
log Pm*x  =  O(log(m/x))  and  logn  =  0(log(mp))  our  algorithm  produces  a  schedule  of  length 
Q(  ,Q'  )  U 

^Mog  log(m'/j)^max/’ 

Note  that  when  p  is  bounded  by  a  polynomial  in  m  the  bound  only  depends  on  m.  In 
particular,  this  implies  the  following  corollary: 

Corollary  2.3.4  There  exists  a  polynomial-time  randomized  algorithm  for  flow  shop  scheduling, 
that,  with  high  probability,  yields  a  schedule  that  is  of  length  at  most  0( logu^'Oiax)- 

Except  for  the  use  of  Sevast’yanov’s  algorithm,  all  of  these  techniques  can  be  carried  out 
in  7 ZjVC2.  We  assign  one  processor  to  each  operation.  The  rounding  in  the  proof  of  Lemma 

2.2.2  can  be  done  in  AfC.  We  set  the  random  delays  and  inform  each  processor  about  the 
delay  of  its  job.  By  summing  the  values  of  p;;-  for  all  of  its  job’s  operations,  each  processor  can 
calculate  where  its  operation  is  scheduled  with  the  delays  and  then  where  it  is  scheduled  in  the 
recursively  spread  out  schedule.  These  sums  can  be  calculated  via  parallel  prefix  operations. 
With  simple  AfC  techniques  we  can  assign  to  each  operation  a  rank  among  all  those  operations 
that  are  scheduled  to  start  at  the  same  time  on  its  machine,  and  thus  flatten  the  spread  out 
schedule  to  a  valid  schedule. 

Corollary  2.3.5  There  exists  a  7 ZjVC  algorithm  for  job  shop  scheduling,  that,  with  high  proba¬ 
bility,  yields  a  schedule  that  is  of  length  at  most  ^(|ogCg(n^max)- 

2Since  most  of  our  discussion  of  parallel  algorithms  for  combinatorial  problems  occurs  in  the  latter  chapters 
of  this  thesis,  we  have  put  off  a  discussion  of  WC,  7 IsVC  and  models  of  parallel  algorithms  until  Chapter  4.  We 
refer  the  reader  who  is  not  familiar  with  these  terms  to  that  discussion. 
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2.3.3  A  Fixed  Number  of  Machines 

It  is  interesting  to  note  that  Sevast’yanov’s  algorithm  for  the  job  shop  problem  can  be  viewed 
as  a  (1  +  m/i3)-approximation  algorithm,  so  that  when  m  and  p  are  constant,  this  is  a  0(1)- 
approximation  algorithm;  that  is,  it  delivers  a  solution  within  a  constant  factor  of  the  optimum. 
The  technique  of  partitioning  the  set  of  jobs  by  size  can  be  applied  to  give  a  much  better 
performance  guarantee  in  this  case.  Now  call  a  job  Jj  big  if  there  is  an  operation  0,j  with 
Pij  >  ell m*x/(w/i3),  where  e  is  an  arbitrary  positive  constant.  Note  that  there  are  at  most 
m2/x3/e  big  jobs,  and  since  m,  p  and  e  are  fixed,  this  is  a  constant. 

Now  use  Sevast’yanov’s  algorithm  to  schedule  all  of  the  small  jobs.  The  resulting  schedule 
will  be  of  length  at  most  (l  +  c)C^ax.  There  are  only  a  constant  (albeit  a  huge  constant)  number 
of  ways  to  schedule  the  big  jobs.  Therefore  the  best  one  can  be  selected  in  polynomial  time 
and  executed  after  the  schedule  of  the  short  jobs.  The  additional  length  of  this  part  is  no  more 
than  C;ax. 

Thus  we  have  shown: 

Theorem  2.3.6  For  the  job  shop  scheduling  problem  where  both  m  and  p  are  fixed,  there  is  a 
polynomial-time  algorithm  that  produces  a  schedule  of  length  <  (2  +  €)C^X. 

2.4  Applications  to  More  General  Scheduling  Problems 

The  fact  that  the  quality  of  our  approximations  is  based  solely  on  the  lower  bounds  IImax  and 
Pmax  makes  it  quite  easy  to  extend  our  techniques  to  the  more  general  problem  of  dag  shop 
scheduling.  We  define  IImax  and  Pmax  exactly  the  same  way,  and  max{lTmax,  Pmax}  remains  a 
lower  bound  for  the  length  of  any  schedule.  We  can  convert  this  dag  shop  scheduling  problem 
to  a  job  shop  problem  by  selecting  for  each  job  an  arbitrary  total  order  that  is  consistent  with 
its  partial  order.  ITmax  and  Pmax  have  the  same  values  for  both  problems.  Therefore,  a  schedule 
of  length  p  •  (IImax  +  Pmax)  for  this  job  shop  instance  is  a  schedule  for  the  original  dag  shop 
scheduling  instance  of  length  0{pC^x). 

A  further  generalization  to  which  our  techniques  apply  is  where,  rather  than  m  different 
machines,  we  have  m'  types  of  machines,  and  for  each  type  we  have  a  specified  number  of 
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identical  machines  of  that  type.  Instead  of  requiring  an  operation  to  run  on  a  particular 
machine,  an  operation  now  has  to  run  on  only  one  of  these  identical  copies.  Pmax  remains  a 
lower  bound  on  the  length  of  any  schedule  for  this  problem.  IImax,  which  was  a  lower  bound  for 
the  job  shop  problem  must  be  replaced,  since  we  do  not  have  a  specific  assignment  of  operations 
to  machines,  and  the  sum  of  the  processing  times  of  all  operations  assigned  to  a  type  is  not  a 
lower  bound.  Let  5,-,  i  =  1, . .  .m',  denote  the  sets  of  identical  machines,  and  let  11(5', • )  be  the 
sum  of  the  lengths  of  the  operations  which  run  on  5,-.  Our  strategy  is  to  convert  this  to  a  job 
shop  problem  by  assigning  operations  to  specific  machines  in  such  a  way  that  the  maximum 
machine  load  is  within  a  constant  factor  of  the  fundamental  lower  bounds  for  this  problem.  To 
obtain  a  lower  bound  on  the  maximum  machine  load,  note  that  the  best  we  could  do  would  be 
to  evenly  distribute  the  operations  across  machines  in  a  set,  thus 


n 


avg 


=  max 
Si 


n  (5,-) 
N 


is  certainly  a  lower  bound  on  the  maximum  machine  load.  Furthermore,  we  can  not  split 
operations,  so  pmax  is  also  a  lower  bound.  We  will  now  describe  how  to  assign  operations  to 
machines  so  that  the  maximum  machine  load  of  the  resulting  job  shop  scheduling  problem  is 
at  most  2navg  +  pmax.  A  schedule  for  the  resulting  job  shop  problem  of  length  p  •  (IImax  +  Pmax) 
yields  asolution  for  the  more  general  problem  of  length  0(p<(navg  + .Pmax))-  Sevast’yanov  [110] 
used  a  somewhat  more  complicated  reduction  to  handle  a  slightly  more  general  setting. 

For  each  operation  to  be  processed  by  a  machine  in  5*,  if  p,;-  >  n(5*)/|5*|,  assign  to 
one  machine  in  5*.  There  are  certainly  enough  machines  in  5*  to  do  this  and  this  contributes 
at  most  pmax  to  the  maximum  machine  load.  Those  operations  not  yet  assigned  are  each  of 
length  at  most  II(5*)/|5*|  and  have  total  length  <  11(5*).  Therefore,  these  can  be  assigned 
easily  to  the  remaining  machines  so  that  less  than  2II(5*)/5*  processing  units  are  assigned  to 
each  machine.  Combining  these  two  bounds,  we  get  an  upper  bound  on  the  maximum  machine 
load  of  2ITaVg  +  pmax  which  is  within  a  constant  factor  of  the  lower  bound  of  max{navg,Pmax}- 


Theorem  2.4.1  There  exists  a  polynomial-time  randomized  algorithm  for  dag  shop  scheduling 
with  identical  copies  of  machines  that,  with  high  probability,  yields  a  schedule  that  is  of  length  at 
0(^SsC;„). 
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Corollary  2.4.2  There  exists  an  7 ZAfC  algorithm  for  dag  shop  scheduling  with  identical  copies  of 
machines  that,  with  high  probability,  yields  a  schedule  that  is  of  length  at  most  0 ( C^x). 

2.5  A  Deterministic  Approximation  Algorithm 

In  this  section,  we  “derandomize”  the  results  of  the  previous  sections,  i.e.,  we  give  a  determin¬ 
istic  polynomial-time  algorithm  that  finds  a  schedule  of  length  0(log2(m  •  n) C „„).  Of  all  the 
components  of  the  algorithm  of  Theorem  2.3.3,  the  only  step  which  is  not  already  deterministic 
is  the  step  that  chooses  a  random  initial  delay  for  each  job  and  then  proves  that,  with  high 
probability,  no  machine  is  assigned  too  many  jobs  at  any  one  time.  In  particular,  the  reduction 
to  the  special  case  in  which  n  and  pmax  are  bounded  by  a  polynomial  in  m  and  p  is  entirely 
deterministic,  and  so  we  can  focus  on  that  case  alone.  We  will  give  an  algorithm  which  deter¬ 
ministically  assigns  delays  to  each  job  so  as  to  produce  a  schedule  in  which  each  machine  has 
0(log(mp))  jobs  running  at  any  one  time.  We  then  apply  Lemma  2.2.2  to  produce  a  schedule 
of  length  0(log2(m  •  /z)C'^iax).  Note  that  the  0(log(m/r))  jobs  per  machine  is  not  as  good  as 
the  probabilistic  bound  of  we  do  not  know  how  to  achieve  this  deterministically. 

However,  by  a  proof  nearly  identical  to  that  of  Lemma  2,2.1,  we  can  show  that  in  order  to 
achieve  this  weaker  bound  on  the  number  of  jobs  per  machine,  we  now  only  need  to  choose 
delays  in  the  range  [0,Pmax/log(m  •/*)].  In  fact,  the  reduced  range  of  delays  yields  a  schedule 
of  length  0(Pmaxlog2(m/i)  +  nmax  log(mp))  which  is  within  an  0(log(m/i))  factor  of  optimal  if 
Pmax  =  o(n 

max  /log  (mu)). 

Our  approach  to  solving  this  problem  is  to  frame  it  as  a  vector  selection  problem  and  then 
apply  techniques  developed  by  Raghavan  and  Thompson  [101,  102]  and  Raghavan  [99]  which 
find  constant  factor  approximations  to  certain  “packing”  integer  programs.  The  approach  is  to 
formulate  the  problem  as  a  {0,  l}-integer  program,  solve  the  linear  programming  relaxation, 
and  then  randomly  round  the  solution  to  an  integer  solution. 

For  certain  types  of  problems  this  yields  provably  good  approximations  with  high  probability 
[101,  102].  Furthermore,  for  many  of  the  problems  for  which  there  are  approximations  with 
high  probability,  the  algorithm  can  be  derandomized.  Raghavan  [99]  has  shown  how  to  do  this 
by  essentially  setting  the  random  bits  one  at  a  time. 
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We  now  state  the  problem  formally: 

Problem  2.5.1  Deterministically  assign  a  delay  to  each  job  in  the  range  [0,  Pmax/  log(m  •  /i)]  so 
as  to  produce  a  schedule  with  no  more  than  O(log(  m/z))  jobs  on  any  machine  at  any  time. 

Lemma  2.5.2  Problem  2.5.1  can  be  solved  in  deterministic  polynomial  time. 

Proof:  Since  we  introduce  delays  in  the  range  [0,.Pmax/log(m  •/*)],  the  resulting  schedule  has 
length  l  =  Pmax  +  nm»x/log(m/i).  We  can  represent  the  processing  of  a  job  j  with  a  given  initial 
delay  d  by  an  (£  •  m)-length  {0,  l}-vector  where  each  position  corresponds  to  a  machine  at  a 
particular  time.  The  position  corresponding  to  machine  m<  and  time  t  is  1  if  m,-  is  processing 
job  Jj  at  time  t ,  and  0  otherwise.  For  each  job  Jj  and  each  possible  delay  d,  there  is  a  vector 
Vjtd  which  corresponds  to  assigning  delay  d  to  Jj. 

Let  Xj  be  the  set  of  vectors  {VjiU  . . V}, <*„„},  where  dmax  =  IIrnax/log(m/i),  and  let  Vj>k(i ) 
be  the  i‘A  component  of  Vjik.  Given  the  set  A  =  {Ai,...,A„}  of  sets  of  vectors,  our  problem 
can  be  stated  as  the  problem  of  choosing  one  vector  from  each  A (denoted  Vj’),  such  that 

£*7  =  O(log(mji)), 

^'=i  IL 

i.e.,  at  any  time  on  any  machine,  the  number  of  jobs  using  that  machine  is  O(log(m/i)). 

As  in  [99],  we  can  reformulate  this  as  a  {0,  l}-integer  program.  Let  z;>i  be  the  indicator 
variable  used  to  indicate  whether  Vjtk  is  selected  from  A Consider  the  integer  program  (IP) 
that  assigns  {0, 1}  values  to  the  variables  Xjik  to  minimize  W  subject  to  the  constraints: 


^mtx 

xi,k  =1,  i  =  1, ...,n, 

k=l 

S  £  xj.kvi.ki})  <  W,  i  =  1, ...,£■  m. 

j= i  *=i 

Let  Wopt  be  the  optimum  value  of  W,  which  is  the  maximum  number  of  jobs  that  ever 
use  a  machine  at  any  time.  We  already  know,  by  Lemma  2.2.1,  that  W0pt  =  G(log(m/i)),  so 
if  we  could  solve  this  integer  program  optimally  we  would  be  done.  However,  the  problem  is 
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AfV- hard.  Instead,  we  rely  on  the  following  theorem  which  is  immediate  from  the  results  in 
[99]  and,  [102]. 

Theorem  2.5.3  [99,  102]  A  feasible  solution  to  (IP)  with  W  =  0(Wopt  +  log(m/x))  can  be 
found  in  polynomial  time. 


We  then  apply  Lemma  2.2.2  and  obtain  the  following  result: 

Theorem  2.5.4  There  exists  a  deterministic  polynomial-time  algorithm  which  finds  a  schedule 
of  length  0(log2(m  •  n)  C^). 

2.6  The  Open  Shop  Problem 

Recall  that  in  the  open  shop  problem  the  operations  of  a  job  can  be  executed  in  any  order. 
Fiala  [40]  has  also  shown  that  if  nmax  >  (16m log m  +  21  m)pmax,  then  is  just  IImax,  and 
there  is  a  polynomial-time  algorithm  to  find  an  optimal  schedule,  but  in  general  this  problem 
is  strongly  A'P-Complete.  We  will  show  that,  in  contrast  to  the  job  and  flow  shop  problems,  a 
simple  greedy  strategy  yields  a  fairly  good  approximation  to  the  optimal  open  shop  schedule. 
Consider  the  algorithm  that,  whenever  a  machine  is  idle,  assigns  to  it  any  job  that  has  not 
yet  been  processed  on  that  machine  and  is  not  currently  being  processed  on  another  machine. 
Anna  Racsmany  [5]  has  observed  that  the  greedy  algorithm  delivers  a  schedule  of  length  at 
most  nmax  t  (m  -  l)pmax.  We  can  adapt  her  proof  to  show  that,  in  fact,  the  greedy  algorithm 
delivers  a  schedule  that  is  no  longer  than  a  factor  of  two  times  optimal.  We  can  also  show  that 
this  is  essentially  a  tight  bound. 

Theorem  2.6.1  The  greedy  algorithm  for  the  open  shop  problem  is  a  2-approximation  algorithm. 
This  is  true  even  when  each  job  Jj  has  an  associated  release  date  rj  on  which  it  becomes  available 
for  processing.  Furthermore,  this  strategy  can  produce  schedules  that  are  as  long  as  (2  -  jL)  times 
optimal. 

Proof:  Consider  the  machine  m*  that  finishes  last  in  the  greedy  schedule;  this  machine  is 
active  sometimes,  idle  sometimes,  and  finishes  by  completing  some  job  J}.  Since  the  schedule 
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is  greedy,  whenever  mk  is  idle,  Jj  is  either  being  processed  by  some  other  machine  or  has  not 
yet  been  released.  Therefore,  the  idle  time  is  at  most  <m»  P>j  +  r;  <  Pj  +  Tj  •  Thus,  machine 
mjt  is  processing  for  at  most  nmax  units  of  time  and  is  idle  for  less  than  Pj  +  r;-  units  of  time; 
hence  C'max  <  nmax  +  Pj  +  r;-.  However,  Pj  +  r;-  is  a  lower  bound  on  the  length  of  the  schedule, 
since  no  processing  of  job  Jj  could  start  until  time  Tj.  Therefore  the  schedule  length  is  within 
a  factor  of  2  of  optimal. 

To  lower  bound  the  worst-case  performance  of  this  algorithm,  consider  an  open  shop  instance 
with  n  =  2m  -  1  jobs.  Job  Ji  has  one  operation  of  size  1  on  each  machine,  job  J2  has  one 
operation  of  size  m  -  1  on  machine  1  and  job  J3  has  one  operation  of  size  m  -  1  on  machine 
m.  Finally,  for  each  machine  t,  2  <  i  <  m  —  1,  there  are  two  jobs,  each  with  one  operation  of 
machine  i.  One  job  has  an  operation  of  size  i  -  1  and  the  other  job  has  an  operation  of  size 
m  -  i.  The  optimal  schedule  is  of  length  m  but  a  greedy  algorithm  can  produce  a  schedule  of 
length  2m  -  1  (see  Figure  2.6).  ■ 

Using  a  slightly  different  (non-greedy)  strategy,  we  can  derive  another  algorithm  which 
achieves  a  schedule  of  length  0(lognC^ax).  This  algorithm  will  also  be  easily  parallelizable, 
thus  putting  the  problem  of  finding  an  0(log  ^-approximation  to  the  open  shop  scheduling 
problem  in  A fC. 

We  define  the  jobs  graph ,  which  is  a  bipartite  graph  that  represents  an  instance  of  the  open 
shop  problem.  One  side  of  the  bipartition  contains  m  nodes,  one  for  each  machine,  whereas  the 
other  side  contains  n  nodes,  one  for  each  job.  If  job  Jj  has  an  operation  on  machine  i  then  the 
jobs  graph  contains  an  edge  between  the  respective  nodes. 

First  consider  the  case  when  all  the  operations  are  of  the  same  size,  say  £.  Let  A  be  the 
maximum  degree  of  any  node  in  the  remaining  jobs  graph.  Then  l A  is  a  lower  bound  on  the 
length  of  the  optimal  schedule  for  this  problem.  However,  since  this  is  a  bipartite  graph  with 
maximum  degree  A,  it  can  be  edge-colored  using  exactly  A  colors.  So  we  edge-color  the  graph, 
and  then  schedule  the  operations  in  each  color  class  separately.  This  produces  a  schedule  of 
length  £A,  which  is  optimal.  As  long  as  there  is  at  least  one  processor  per  operation,  this  can 
be  done  in  NC  using  the  edge-coloring  algorithm  of  Lev,  Pippinger,  and  Valiant  [84]. 

We  can  extend  this  to  solve  the  open  shop  problem  by  first  using  the  techniques  of  Section 
2.3.1  to  reduce  the  problem  to  the  case  where  all  the  operations  have  sizes  polynomial  in  n,  and 
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Figure  2.3:  Instance  for  lower  bound  on  performance  of  greedy  open  shop  algorithm.  Black 
squares  represent  the  operations  of  job  J\.  Schedule  (a)  is  the  optimum  schedule,  but  schedule 
(b)  is  a  greedy  schedule  in  which  Jx  is  not  started  until  after  time  m-1. 
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then  by  rounding  the  operation  sizes  so  they  are  all  powers  of  2.  Now  there  are  only  0( log  n ) 
different  operation  sizes.  We  schedule  each  one  separately,  using  the  edge-coloring  strategy 
above.  The  schedule  we  get  for  any  particular  l  is  optimal  for  that  operations  of  that  size; 
hence  each  of  the  O(logn)  schedules  we  produce  is  of  length  0(C^X).  Concatenating  these 
schedules  together,  and  observing  that  all  the  rounding  can  easily  be  done  in  AfC,  we  obtain 
the  following  theorem: 

Theorem  2.0.2  An  open  shop  schedule  of  length  0(log  nC^*)  can  be  found  in  AfC. 

2.7  Conclusions  and  Open  Problems 

We  have  given  the  first  polynomial-time  polylog-approximation  algorithms  for  minimizing  the 
maximum  completion  time  for  the  problems  of  job  shop  scheduling,  flow  shop  scheduling,  dag 
shop  scheduling  and  a  generalization  of  dag  shop  scheduling  in  which  there  are  groups  of  identi¬ 
cal  machines.  Clearly  the  most  basic  question  to  be  pursued  is  the  development  of  approxima¬ 
tion  algorithms  with  even  better  performance  guarantees.  It  is  our  belief  that  the  0(logpmax) 
factor  that  is  introduced  by  the  techniques  of  section  2.2  can  be  improved  upon,  perhaps  even  by 
a  simple  greedy  method.  However,  such  methods  have  proved  frustratingly  difficult  to  analyze. 
The  other  logarithmic  factor  in  the  performance  bound  seems  much  more  difficult  to  improve 
upon. 

An  interesting  consequence  of  our  results  is  the  following  observation  about  the  structure 
of  shop  scheduling  problems.  Assume  we  have  a  set  of  jobs  which  need  to  run  on  a  set  of 
machines.  We  know  that  any  schedule  for  the  associated  open  shop  problem  must  be  of  length 
ft(nmsx  +  f’max)*  Furthermore,  we  know  that  no  matter  what  type  of  partial  ordering  we  impose 
on  the  operations  of  each  job  we  can  produce  a  schedule  of  length  0((IImax  +  Pmax)i^g^)- 
Hence  for  any  instance  of  the  open  shop  problem,  we  can  impose  an  arbitrary  partial  order  on 
the  operations  of  each  job  and  increase  the  length  of  the  optimal  schedule  by  a  factor  of  no 
more  than  O(jg^). 

An  interesting  combinatorial  question  is  “Can  this  imposition  of  a  partial  order  really  make 
the  optimal  schedule  that  much  longer  than  0(nmax  +  -Pmax)?”  In  other  words,  how  good  are 
nmax  and  Pmax  as  lower  bounds?  We  have  seen  that  in  two  interesting  special  cases,  job  shop 
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scheduling  with  unit-length  operations  and  open  shop  scheduling,  there  is  a  schedule  of  length 
0(IImax  +  Pm*x)-  Does  there  always  exist  an  0(IImax  +  Pmax)  schedule  for  the  general  job,  flow 
or  dag  shop  scheduling  problem? 

Beyond  this,  there  are  a  number  of  interesting  questions  raised  by  this  work,  including 

•  Do  there  exist  parallel  algorithms  that  achieve  the  approximations  of  our  sequential  al¬ 
gorithms?  For  the  general  job  shop  problem  this  seems  hard,  since  we  rely  heavily  on 
the  algorithm  of  Sevast’yanov.  For  open  shop  scheduling,  however,  a  simple  sequential 
algorithm  achieves  a  factor  of  2,  whereas  the  best  MC  algorithm  that  we  have  achieves 
only  an  O(log  ^-approximation.  As  a  consequence  of  the  results  above,  all  one  would 
need  to  do  is  to  produce  any  greedy  schedule. 

•  Are  there  simple  variants  of  the  greedy  algorithm  for  open  shop  scheduling  that  achieve 
better  performance  guarantees?  For  instance,  how  good  is  the  algorithm  that  always 
selects  the  job  with  the  maximum  total  (remaining)  processing  time? 

•  Our  algorithms,  while  polynomial-time  algorithms,  are  inefficient.  Are  there  significantly 
more  efficient  algorithms  which  have  the  same  performance  guarantees?  Recent  work 
by  Plotkin,  Shmoys  and  Tardos  [97]  gives  a  significantly  faster  algorithm,  which  does 
not  use  linear  programming,  to  accomplish  the  derandomization  described  in  Section  2.5. 
Therefore  the  major  remaining  problem  is  to  develop  a  faster  version  of  the  algorithm  of 
Sevast’yanov. 


Chapter  3 


Scheduling  Parallel  Machines 
On-line1 


3.1  Introduction 

The  scheduling  of  a  set  of  tasks  on  parallel  machines2  is  a  basic  problem  in  combinatorial  opti¬ 
mization,  with  an  number  of  increasingly  important  applications.  As  we  discussed  in  Chapter  1 
there  is  a  rich  literature  on  parallel  machine  scheduling,  but  the  overwhelming  majority  of  these 
results  assume  that  a  complete  specification  of  the  instance  is  available  before  the  algorithm 
begins  to  construct  a  schedule.  This  fails,  however,  to  capture  many  of  the  scheduling  problems 
that  arise  in  practice.  Consider,  for  example,  the  allocation  of  jobs  to  the  processing  units  of  a 
multiprocessor:  the  scheduler  does  not  in  advance  have  complete  knowledge  of  a  job’s  running 
time,  or  of  what  jobs  will  be  created  and  require  processing  in  the  future.  Or,  consider  the 
owner  of  a  garage,  who  must  schedule  his  group  of  car  repairmen.  The  owner  does  not  know 
how  many  cars  will  arrive  to  be  repaired  on  a  given  day,  and  also  does  not  know  how  long  it  will 
take  to  repair  any  particular  car.  In  this  chapter,  we  will  study  on-line  algorithms  -  algorithms 
that  work  without  any  clairvoyant  assumptions  -  for  the  most  basic  types  of  parallel  machine 
scheduling  models.  Our  algorithmic  results  are  based  on  two  rather  general  techniques  that 

'This  chapter  describes  joint  work  with  David  Shmoys  and  David  Williamson  [113]. 

2See  Chapter  1  for  the  definition  of  the  parallel  machine  model  and  the  associated  notation. 
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allow  us  to  convert  algorithms  that  need  more  complete  knowledge  of  the  input  data  into  ones 
that  need  less  advance  knowledge. 

When  on-line  scheduling  has  been  studied  in  the  past,  the  models  that  have  been  considered 
were  typically  of  the  following  form:  the  existence  of  a  job  is  unknown  until  a  certain  release 
date ,  at  which  point  the  processing  requirement  for  that  job  is  completely  specified.  We  will 
consider  more  realistic  models,  where  the  processing  requirement  of  a  job  is  also  unknown  when 
it  starts  processing,  and  can  only  be  determined  by  processing  the  job  and  observing  how  long 
it  takes  to  be  completed.  In  fact,  our  results  show  that  the  traditional  sort  of  on-line  scheduling 
problem  is  provably  not  much  harder  than  its  off-line  analogue,  whereas  the  lack  of  knowledge 
about  job  sizes  can  drastically  affect  the  quality  of  solutions  that  can  be  obtained. 

We  shall  evaluate  on-line  algorithms  in  terms  of  their  competitive  ratio  [116].  Let  C*&X(I) 
be  the  makespan  of  a  deterministic  on-line  algorithm  A  on  instance  1.  Algorithm  A  is  said  to 
have  comvetitive  ratio  c  (or  is  said  to  be  c-competitive)  if  C*&X(I)  <  c  •  C^ax(I)  +  0(1)  for  all 
problem  instances  X.  If  A  is  a  randomized  algorithm,  then  A  is  said  have  competitive  ratio  c 
(or  is  said  to  be  c-competitive)  if  -E[Cmax(X)]  <  c  ■  C*MX(I)  +  0(1)  for  all  instances  1,  where  the 
expectation  is  taken  over  all  random  choices  of  the  algorithm  A.  Although  these  notions  apply 
to  algorithms  without  any  restrictions  on  their  running  times,  we  will  focus  on  polynomial-time 
on-line  algorithms,  rather  than  the  purely  information-theoretic  analogue.  Nonetheless,  our 
lower  bounds  are  based  on  information-theoretic  arguments. 

In  a  non-preemptive  model,  it  may  be  unrealistic  to  assume  that  once  a  job  is  started,  it 
must  be  run  until  its  (unknown)  completion  time,  without  any  form  of  recourse.  A  central 
aspect  of  our  models  is  that  we  introduce  the  notion  of  restarts :  a  job  may  be  canceled  and 
later  started  again,  but  it  is  started  again  from  scratch.  For  example,  in  the  uniformly  related 
machine  model,  we  may  wish  to  cancel  a  job  that  is  taking  longer  than  “anticipated”,  and  then 
start  it  again  on  a  faster  machine. 

The  results  of  this  chapter  are  as  follows.  We  introduce  two  general  techniques  to  convert 
off-line  algorithms  to  algorithms  that  require  less  initial  information.  Using  the  first  technique, 
we  show  that  we  can  focus  on  the  case  without  release  dates,  since  the  situation  in  which  there 
are  unknown  job  arrivals  and  unknown  processing  times  can  be  reduced,  with  only  a  factor  of 
2  increase  in  the  competitive  ratio,  to  one  in  which  there  are  only  unknown  processing  times. 
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This  result  also  holds  when  comparing  a  model  in  which  there  are  only  unknown  arrivals  and  its 
off-line  equivalent.  As  a  consequence,  we  consider  the  situation  in  which  all  jobs  (of  unknown 
size)  are  available  to  be  scheduled  from  the  start.  For  both  uniformly  related  and  unrelated 
machines,  we  use  our  second  technique  to  convert  off-line  algorithms  into  algorithms  that  need 
not  be  given  the  processing  time  of  each  job.  Nonetheless,  the  resulting  on-line  algorithms  do 
not  suffer  too  great  a  degradation  in  the  quality  of  the  solution  produced. 

It  is  quite  simple  to  obtain  tight  bounds  for  the  identical  machine  model:  one  of  the  oldest 
results  in  scheduling  theory  is  an  on-line  algorithm  of  Graham  [52],  which  produces  a  schedule 
of  length  at  most  (2  -  £)C„ax;  we  give  a  straightforward  proof  that  this  is  exactly  the  best 
possible  ratio.  We  also  give  an  identical  tight  bound  on  the  competitive  ratio  obtainable  in 
the  preemptive  variant.  This  has  the  important  consequence  that,  although  complexity  theory 
shows  that  there  is  a  fundamental  difference  between  the  preemptive  and  non-preemptive  mod¬ 
els,  this  difference  disappears  when  scheduling  jobs  on-line.  We  also  show  that  randomization 
is  of  little  help  to  the  scheduler,  proving  that  no  randomized  algorithm  can  achieve  competitive 
ratio  better  than  (2  -  O(-^-L-)),  even  against  an  oblivious  adversary.  This  result  is  in  sharp 
contrast  to  other  recent  work  in  on-line  algorithms,  in  which  randomness  has.  been  shown  to 
significantly  increase  the  performance  of  the  algorithms  [71,  119]. 

We  then  show  that  on-line  scheduling  on  uniformly  related  machines  is  much  harder  than 
on  identical  machines.  This  is  also  quite  different  from  the  off-line  setting,  where  results  for 
identical  machines  have  typically  extended  to  the  case  where  machines  run  at  different  speeds.  In 
our  on-line  model,  we  show  that  this  generalization  does  make  the  problem  significantly  harder: 
we  prove  that  the  optimal  competitive  ratio  is  0(log  m).  We  generalize  this  model  to  unrelated 
machines  by  assuming  that  for  each  job,  the  relative  speeds  of  the  machines  are  known,  but 
its  size  is  unknown.  Iu  this  setting,  we  can  also  obtain  an  on-line  algorithm  with  an  0(log  n) 
competitive  ratio.  Once  again,  we  also  give  identical  results  for  the  preemptive  variants  of  these 
models.  For  uniformly  related  machines,  we  also  show  how  to  take  advantage  of  the  situation 
when  the  relative  speeds  of  the  machines  are  not  too  different;  we  give  an  0(log  ./^-competitive 
algorithm  for  the  non-preemptive  model,  where  R  is  the  ratio  of  the  fastest-to-slowest  machine 
speeds.  Finally,  we  can  show  that  this  bound  is  tight,  in  the  following  sense:  for  any  ratio  of 
machine  speeds  R  <  m  we  prove  a  lower  bound  of  fI(log  R)  on  the  competitive  ratio  of  any 
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deterministic  on-line  scheduling  algorithm. 

3.1.1  Previous  Work  on  On-line  Algorithms 

On-line  algorithms  have  been  studied  for  a  variety  of  problem  domains.  Some  of  the  oldest  of 
these  results  are  for  the  bin-packing  problem,  where  the  quality  of  an  on-line  algorithm  was 
similarly  measured  by  the  worst-case  ratio  of  the  performance  of  the  on-line  algorithm  to  the 
best  off-line  algorithm.  The  idea  to  use  this  ratio  as  a  measure  of  the  quality  of  an  on-line 
algorithm  was  not  utilized  in  other  areas  of  computer  science  until  the  work  of  Sleator  and 
Tarjan,  who  studied  several  questions  in  paging  and  list  maintenance  [115].  These  problems  are 
inherently  on-line,  since  one  must  maintain  the  list  or  decide  which  pages  should  be  in  primary 
memory  without  knowing  what  the  future  sequence  of  requests  will  be.  Until  the  work  of  Sleator 
and  Tarjan,  people  typically  evaluated  strategies  for  these  problems  by  asymptotic  average-case 
analysis.  These  analyses,  however,  were  not  always  consistent  with  experimental  evidence  about 
the  performance  of  the  various  strategies.  The  idea  of  using  the  competitive  ratio  to  evaluate 
these  algorithms  proved  exciting,  since  in  certain  cases  their  results  lent  theoretical  support  to 
the  superiority  of  the  best  strategies  in  practice. 

The  subsequent  growth  of  work  on  on-line  algorithms  was  explosive.  An  attempt  to  provide 
a  general  theory  of  on-line  algorithms  was  made  by  Borodin,  Linial  and  Saks,  who  defined 
the  “metrical  task  system”  and  gave  tight  bounds  on  the  performance  of  algorithms  in  this 
framework  [17].  The  generality  of  the  characterization  led  to  pessimistic  worst-case  performance 
bounds  and  therefore  limited  its  usefulness. 

A  less  general  framework  that  quickly  gained  a  great  deal  of  notoriety  is  the  k-sewer  problem, 
introduced  by  Manasse,  McGeoch  and  Sleator  [87].  Given  k  “servers”  in  a  metric  space,  the  k- 
server  problem  requires  the  service  of  a  sequence  of  service  requests,  where  a  service  request  is  a 
point  in  the  metric  space  and  is  served  by  moving  a  server  to  that  point.  The  problem  is  on-line 
in  that  the  future  sequence  of  requests  is  unknown.  This  problem  generalizes  several  important 
problems,  including  caching,  paging  and  planning  the  motion  of  the  heads  of  a  two-headed  disk. 

The  famous  ^-server  conjecture  is  that  there  is  a  ^-competitive  algorithm  for  this  problem; 
it  is  known  that  no  better  competitive  ratio  is  possible  [87].  This  conjecture  has  been  proved  to 
be  true  in  a  number  of  special  cases  [87,  24,  23],  and  for  the  general  case  there  exist  algorithms 
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whose  competitive  ratios  are  known  to  depend  only  on  k ,  albeit  exponentially  [41,  54]. 

This  flurry  of  work  has  motivated  another  stream  of  work  in  on-line  algorithms,  in  which 
people  have  studied  how  well  one  can  solve  basic  problems  in  combinatorial  optimization  on¬ 
line.  Graph  coloring  [63,  119],  matching  [71],  weighted  matching  [68],  and  the  transportation 
problem  [72]  have  all  been  considered  in  a  model  where  nodes  of  the  graph  (and  incident  edges) 
are  introduced  one  at  a  time  and  an  irrevocable  decision  must  be  made  about  the  color  of  the 
node  or  which  edge  should  be  in  the  matching.  A  variety  of  path-finding  problems  have  been 
considered  as  well,  where  one  tries  to  find  the  shortest  path  possible  in  an  environment  about 
which  one  does  not  have  total  information  [94,  29,  15]. 

The  work  in  this  chapter  can  be  seen  as  in  the  spi'  *  »ot’'  of  these  directions  of  work  in 
on-line  algorithms.  Scheduling  problems  are,  like  paging  and  caching,  basic  issues  in  computer 
control;  they  are  also  basic  questions  in  combinatorial  optimization. 

3.1.2  Previous  Work  on  On-line  Scheduling 

A  model  of  on-line  scheduling  that  is  very  different  from  ours  is  closely  related  to  a  variant  of  the 
bin-packing  problem.  When  the  number  of  bins  is  fixed,  on-line  bin-packing  can  be  interpreted 
as  a  type  of  on-line  scheduling,  where  the  jobs  are  given  in  a  list  and  scheduled  in  turn.  The  job 
currently  being  scheduled  is  completely  specified,  but  the  jobs  later  in  the  list  are  completely 
unknown.  This  model  corresponds  less  to  a  dynamic  environment  and  more  to  a  first  come  first 
serve  reservation  system  for  some  date  in  the  future.  It  also  bears  some  similarity  to  the  on-line 
graph  problems  mentioned  above.  Recently  Johnson  [65]  and  Chandrasekaran  and  Narayanan 
[91]  have  proved  some  lower  bounds  in  this  model. 

In  terms  of  previous  work  on  our  model  of  on-line  scheduling,  some  attention  has  been 
given  in  the  past  to  the  question  of  unknown  release  dates.  In  the  preemptive  model,  Gonzales 
and  Johnson  gave  a  polynomial  time  algorithm  that  optimally  solves  this  problem  on  identical 
machines.  Sahni  and  Cho  extended  this  result  to  apply  to  uniformly  related  machines.  In  the 
non-preemptive  model  Gusfield  considered  a  more  general  problem  on  identical  machines,  in 
which  each  job  has  an  associated  due  date,  and  the  goal  is  to  minimize  the  maximum  lateness. 
He  proved  a  bound  of  (2-  ^)/W  on  the  difference  between  the  maximum  lateness  produced 
by  an  on-line  heuristic  and  the  minimum  possible  maximum  lateness. 
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Relatively  little  work  has  been  done  on  the  model  of  on-line  scheduling  that  we  consider. 
Chandra,  Karloff,  and  Vishwanathan  [21]  proposed  studying  on-line  scheduling  with  unknown 
processing  times,  and  analyzed  the  problem  of  minimizing  the  average  completion  time  on  a 
single  machine  with  preemption.  In  addition  to  the  algorithms  for  identical  machines  given  by 
Graham  [52],  the  only  other  work  for  parallel  machines  known  to  us  prior  to  ours  is  that  of  Jaffe 
[64]  and  Davis  and  Jaffe  [28].  Davis  and  Jaffe  show  that  in  a  restricted  model  without  restarts, 
an  on-line  algorithm  for  non-preemptive  scheduling  of  uniformly  related  machines  cannot  have 
competitive  ratio  better  than  Cl(i/m).  Jaffe  gives  an  algorithm  for  this  case  with  competitive 
ratio  0(y/m). 

Recently  Feldmann,  Sgall  and  Teng  [38]  studied  on-line  scheduling  on  a  mesh  of  identical 
processors,  where  one  must  allocate  a  submesh  of  a  specified  size  to  a  job,  when  the  processing 
time  of  that  job  is  unknown.  They  prove  a  0(v/log  log  m)  bound  on  the  competitive  ratio  in 
this  model.  In  addition,  they  study  a  number  of  other  architectures  such  as  hypercubes  and 
trees. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  3.2  we  show  that  the  introduction 
of  unknown  release  dates  into  a  scheduling  problem  does  not  make  the  problem  too  much  harder. 
As  a  result  we  concentrate  on  the  situation  where  all  the  jobs  are  available  at  time  0  but  have 
unknown  processing  requirements.  In  Section  3.3  we  present  our  on-line  scheduling  algorithms 
for  the  various  parallel  machine  models,  and  in  Section  3.4  we  give  the  corresponding  lower 
bounds.  In  Section  3.5  we  discuss  on-line  scheduling  in  several  other  scheduling  models,  and 
we  suggest  some  further  directions  for  research  in  Section  3.6. 

3.2  Unknown  Release  Dates 

Our  model  of  on-line  scheduling  includes  both  unknown  release  dates  for  jobs  and  unknown  job 
sizes.  In  this  section  we  will  show  that,  with  respect  to  minimizing  schedule  length,  the  first 
element  is  much  less  important  than  the  second  element.  We  will  show  that  if  the  release  dates 
are  unknown,  then  we  can  assume  that  all  jobs  are  always  available,  and  repeatedly  use  an 
algorithm  that  works  in  this  environment;  the  feasible  schedules  produced  by  this  simulation 
are  of  only  somewhat  lesser  quality  than  can  be  obtained  in  the  special  case.  This  result  does 


3.2.  UNKNOWN  RELEASE  DATES 


43 


not  depend  on  the  remaining  specifics  of  the  scheduling  environment;  in  particular,  it  allows 
us  to  use  off-line  algorithms  to  obtain  algorithms  that  can  handle  unknown  release  dates  (but 
where  the  processing  times  are  known  once  released),  as  well  as  allowing  us  to  focus  on  on-line 
algorithms  in  the  case  when  all  jobs  are  released  at  time  0. 

Theorem  3.2.1  Let  A  be  a  polynomial-time  scheduling  algorithm  which  works  in  an  environment 
in  which  each  job  to  be  scheduled  is  available  at  time  0  and  always  produces  a  schedule  of  length 
at  most  pC^tx.  For  the  analogous  environment  in  which  the  existence  of  a  job  is  unknown  until 
its  release  date,  there  exists  another  polynomial-time  algorithm  A'  that  works  in  this  more  general 
setting,  and  produces  a  schedule  of  length  at  most  2 pC^x. 

Proof:  Let  1  be  an  instance  including  jobs  with  unknown  release  dates,  and  let  So  he  the  set 
of  jobs  available  at  time  0.  The  scheduler  applies  algorithm  A  and  schedules  the  jobs  in  So, 
finishing  at  time  F0.  Let  Si  be  the  set  of  jobs  released  in  time  (0,  F0}.  The  scheduler  now,  at 
time  F0,  applies  algorithm  A  to  schedule  Si,  finishing  at  time  F\.  In  general  let  S,+i  be  the  set 
of  jobs  released  in  (F_i,  Fj],  and  let  F(  be  the  point  in  time  when  the  schedule  for  S;  completes. 
At  time  F<,  the  scheduler  uses  algorithm  A  to  schedule  the  jobs  in  S,+i.  Let  F*  be  the  finishing 
point  of  the  entire  schedule.  (See  Figure  3.1.) 

To  analyze  the  length  of  the  resulting  schedule,  consider  the  modified  problem  instance  X' 
where  the  jobs  in  S*  are  released  at  time  Ft_o.  Since  these  jobs  are  released  at  an  earlier  point 
in  time  in  T  than  in  T,  certainly  C^x(X')  <  C^X(I). 

Now  note  that  F*_2  +  F*  -  Ft_i  <  pC^ax(I'),  since  the  jobs  in  5*  are  not  released  until  F*_2 
and  the  properties  of  algorithm  A  guarantee  that  F*  -  Fb_i  is  within  a  factor  of  p  of  the  shortest 
schedule  for  Sk.  Similarly,  F*_!  -  F*_2  <  /?C*ax(I).  Therefore  Ft  <  2pC^x(T)  <  2pC’ma_x{l). 
■ 

This  theorem  is  very  general,  in  that  it  can  be  applied  to  a  number  of  different  types  of 
scheduling  scenarios.  In  particular,  it  shows  that  to  produce  an  on-line  algorithm  for  our  full 
on-line  model,  we  can  modify  an  algorithm  for  the  case  in  which  all  jobs  are  available  at  time 
0  and  processing  times  are  unknown,  increasing  the  competitive  ratio  of  the  algorithm  by  only 
a  factor  of  2.  Further,  the  theorem  applies  not  only  to  problems  of  parallel  machine  scheduling 
but  also  to  the  entire  class  of  shop  scheduling  problems,  including  open  shop,  flow  shop  and  job 
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Schedule  for  Sq  jobs  S|  schedule 


Time:  0  fj 


S|  jobs  release  times  S2  release  times 


Figure  3.1:  Using  an  algorithm  for  a  scheduling  environment  vvithout  release  dates  to  schedule 
in  an  environment  with  release  dates. 

shop  [112],  In  addition  it  applies  to  the  scheduling  model  of  Feldman,  Sgall  and  Teng  mentioned 
in  section  3.1.2.  They  studied  the  on-line  allocation  of  submeshes  of  a  large  mesh  to  different 
jobs,  but  their  algorithms  only  worked  when  all  jobs  were  available  at  time  0.  Our  theorem 
generalizes  their  result  to  a  0(Vloglogm)  on-line  algorithm  even  when  jobs  have  unknown 
release  dates.  Finally,  our  theorem  yields  the  following  corollary: 

Corollary  3.2.2  If  job  release  dates  are  unknown,  but  once  a  job  arrives  its  size  is  known,  there 
is  a  polynomial-time  on-line  algorithm  for  scheduling  uniformly  related  machines  that  comes  within 
a  factor  of  (2  +  e)  of  optimal  and  a  polynomial-time  on-line  algorithm  for  scheduling  unrelated 
machines  that  comes  within  a  factor  of  4  of  optimal. 

Proof:  Direct  from  the  theorem  and  previously  known  results  on  the  approximability  of  these 
problems  [61,  83]. 

For  identical  machines  this  result  yields  a  (2  +  e)-approximation  algorithm;  however,  some¬ 
thing  slightly  better  was  already  known.  In  1966  Graham  showed  that  list  scheduling  was 
a  (2  -  ^-approximation  algorithm  for  scheduling  identical  machines.  In  list  scheduling,  the 
scheduler  takes  any  list  of  jobs  and,  whenever  a  machine  becomes  available,  places  the  next  job 


3.2.  UNKNOWN  RELEASE  DATES 


45 


on  the  list  on  that  machine.  It  is  not  hard  to  see  that  if  this  strategy  is  extended  so  that  newly 
arriving  jobs  are  added  to  the  end  of  the  list,  then  list  scheduling  is  a  (2  -  ^-approximation 
algorithm  for  scheduling  identical  machines  with  release  dates.  See,  for  example,  [56]. 

Despite  the  fact  that  unknown  release  dates  do  not  make  a  scheduling  problem  much  more 
difficult,  we  can  show  that  they  sometimes  do  make  it  more  difficult  to  schedule  machines 
near-optimally. 

Theorem  3.2.3  There  is  no  on-line  algorithm  for  non-preemptive  scheduling  of  identical  machines 
with  unknown  release  dates  but  known  processing  requirements  with  competitive  ratio  better  than 
10/9,  even  if  restarts  are  allowed. 

It  is  interesting  to  note  that  this  is  not  the  case  when  preemption  is  allowed;  Sahni  and  Cho 
have  shown  that  there  is  a  polynomial-time  algorithm  to  solve  that  problem  optimally  [103]. 
Intuitively  this  is  not  surprising,  since  preemption  allows  you  to  adjust  to  new  information 
without  losing  work  done  beforehand. 

Proof:  Consider  the  machine  environment  that  consists  of  two  identical  machines.  At  time  0 
there  are  two  jobs  of  size  3  (A  and  B)  released  and  one  job  of  size  2  (C).  We  consider  several 
cases. 

•  The  scheduler  initially  schedules  A  on  machine  1  and  B  on  machine  2  at  time  0,  and 
restarts  neither  until  (possibly)  time  2.  In  this  case  the  adversary  introduces  a  job  (D) 
of  size  4  at  time  2.  If  the  scheduler  does  not  interrupt  A  or  B  and  start  D  at  time  2  the 
minimum  schedule  length  will  be  7;  if  the  scheduler  does  interrupt  one  to  run  job  D  then 
the  minimum  schedule  is  of  length  8.  The  optimal  schedule  would  have  been  of  length  6: 
C  and  D  on  machine  1  and  A  and  B  on  machine  2.  In  this  case  the  performance  ratio  is 
at  least  |. 

•  The  scheduler  initially  schedules  A  on  machine  1  and  B  on  machine  2,  and  interrupts  at 
least  one  of  them  at  time  1.  At  time  2  the  adversary  releases  a  job  of  size  2;  the  optimal 
schedule  for  this  example  is  of  length  5,  but  we’ve  forced  a  schedule  of  length  6,  therefore 
the  performance  ratio  is  at  least  f . 
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•  The  scheduler  only  schedules  one  job  to  begin  at  time  0.  In  this  case  the  adversary 
introduces  a  job  of  size  2  at  time  1;  the  resulting  performance  ratio  is  at  least  f . 

•  The  scheduler  schedules  A  and  C  at  time  0.  The  adversary  then  introduces  a  job  of  size 
7  at  time  1.  If  the  scheduler  does  not  interrupt  A  or  C  at  time  1,  the  minimum  length 
of  the  produced  schedule  is  9,  whereas  the  optimal  schedule  would  be  of  length  8;  hence 
the  performance  bound  is  at  least  |.  If  the  scheduler  preempts  A  or  C  at  time  1,  the 
introduces  a  job  of  size  3  at  time  3.  The  optimal  schedule  for  this  instance  is  of  length  9, 
and  the  algorithm  produces  a  schedule  of  length  at  least  10.  Therefore  the  performance 
bound  in  this  case,  and  in  general,  is  at  least  -j.  ■ 

In  light  of  the  results  in  this  section,  for  the  remainder  of  this  chapter  we  shall  focus  on 
scheduling  environments  in  which  all  jobs  are  available  to  be  scheduled  at  time  0. 


3.3  Algorithms  for  On-Line  Scheduling 

In  this  section  we  will  present  on-line  scheduling  algorithms  for  the  basic  parallel  machine  mod¬ 
els.  We  first  note  that  in  the  case  of  identical  machines,  the  well-known  list  scheduling  algorithm 
of  Graham  [52]  always  comes  within  a  factor  of  (2  -  of  the  optimal  length  schedule,  and 
comes  within  the  same  bound  of  the  optimal,  preemptive  schedule  length.  Since  list  scheduling 
does  not  depend  on  the  sizes  of  the  jobs,  list  scheduling  is  ah  on-line  algorithm  with  a  (2  -  —) 
competitive  ratio. 

Theorem  3.3.1  [Graham]  There  is  an  on-line  algorithm  for  scheduling  identical  machines  that 
achieves  competitive  ratio  (2  -  ~)  in  both  the  preemptive  and  non-preemptive  models. 

For  the  other  machine  models,  we  will  present  a  general  technique  which  yields  an  O(logn)- 
competitive  algorithm  for  each  of  them.  We  will  then  show  how  to  convert  this  general  algo¬ 
rithm  to  an  O(logro)-competitive  algorithm  for  both  preemptive  and  non-preemptive  uniformly 
related  machines.  We  will  also  present  an  algorithm  for  non-preemptive  uniformly  related  ma¬ 
chines  which  has  a  competitive  ratio  of  0(min(logm,log(si/sm)),  assuming  Si  >  s2  >  •  •  •  >  sm. 
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3.3.1  The  General  Technique 

Our  general  algorithm  depends  on  the  existence  of  either  polynomial-time  algorithms  or  polynomial¬ 
time  p-approximation  algorithms  for  scheduling  in  the  various  machine  models. 

Theorem  3.3.2  Suppose  that  there  is  a  p-approximation  algorithm  A  for  the  [non-preemptive/preempt 
[uniformly  related/unrelated]  machine  problem,  and  let  I  be  an  instance  of  this  problem.  Then  there 
is  an  on-line  scheduling  algorithm  which  produces  a  schedule  no  longer  than 
(4plogn  +  4p  log  2p  +  1)C;„X(I). 

Proof:  The  on-line  algorithm  works  by  repeatedly  applying  algorithm  A  to  the  jobs,  after 
guessing  the  size  of  each  job,  Given  a  schedule  produced  by  the  algorithm  A,  our  on-line 
algorithm  will  run  each  job  at  the  particular  time  interval  and  on  the  particular  machine 
specified  by  the  schedule.  In  the  preemptive  model,  the  job  may  not  have  finished  all  its 
processing  by  the  end  of  the  schedule,  in  which  case  we  preempt  the  job.  In  the  non-preemptive 
model,  we  cancel  the  job  if  it  is  not  completely  processed  in  the  time  allotted.  In  either  case, 
if  the  job  does  not  complete  we  will  be  able  to  update  our  estimate  of  the  size  of  that  job. 

For  the  sake  of  simplicity,  we  will  assume  that  the  data  is  normalized  so  that  the  fastest 
machine  for  each  job  Jj  has  speed  si;-  =  1.  One  result  of  this  assumption  is  that  any  job  of  size 
Pj  takes  time  pj  to  complete  on  the  machine  that  processes  it  fastest. 

The  complete  on-line  algorithm  is  below. 

Step  1  Pick  any  job  J ■  and  njn  it  to  completion  on  a  machine  m;>  such  that  =  1.  Let  the 
time  that  this  takes  be  denoted  by  A. 

Step  2  Let  q  =  A /pn. 

Step  3  Use  algorithm  A  to  construct  a  schedule  for  all  jobs  that  have  not  yet  completed, 
setting  pj  «—  q  for  all  remaining  jobs  Jj.  Run  the  jobs  in  this  schedule,  preempting  or 
canceling  all  jobs  that  do  not  complete  in  the  time  allotted  to  them  by  the  schedule. 

Step  4  If  any  jobs  have  not  yet  completed,  set  q  «-  2q  and  go  to  Step  3. 

Let  C^ax  be  the  length  of  the  optimal  schedule.  We  will  now  analyze  the  algorithm  and  the 
length  of  the  schedule  it  produces.  First,  in  Step  1,  the  time  A  taken  by  job  Jj-  on  machine 
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mi>  is  at  most  C^,  since  the  optimal  schedule  can  be  no  shorter  than  the  time  taken  by  any 
job  running  on  the  machine  which  processes  it  the  fastest. 

Next,  we  show  that  the  first  iteration  of  Step  3  produces  a  schedule  no  longer  than  A  <  C^nx. 
One  way  to  construct  a  schedule  is  to  assign  each  of  the  n  jobs  to  the  machine  that  processes 
it  the  fastest.  In  the  worst  case,  all  n  jobs  would  be  assigned  to  the  same  machine,  and  this 
schedule  would  have  length  nq  =  A/p.  Since  the  schedule  produced  by  algorithm  A  is  no  longer 
than  p  times  optimal,  it  must  produce  a  schedule  of  length  no  longer  than  A  <  C ^ax. 

In  addition,  future  iterations  of  Step  3  must  produce  schedules  of  length  no  longer  than 
2/jC*  ax.  Suppose  the  algorithm  performs  an  iteration  of  Step  3  in  which  the  jobs  are  assigned 
size  q.  Since  the  algorithm  did  not  finish  processing  all  jobs  in  the  previous  iteration,  we  know 
that  the  instance  being  scheduled  must  have  some  jobs  Jj  such  that  p'-  >  q/2.  An  optimal 
schedule  for  this  subset  of  jobs  must  take  time  no  greater  than  C^sx.  Ensuring  that  each  of 
these  jobs  gets  processed  for  q  units  can  increase  the  length  of  the  optimal  schedule  for  these 
jobs  by  no  more  than  a  factor  of  2.  Finally,  the  algorithm  A  will  find  a  schedule  for  these  jobs 
that  is  no  more  than  p  times  as  long  as  the  optimal  schedule,  so  that  the  schedule  can  be  no 
longer  than  2pQ,ax. 

To  derive  our  O((log  n)C^ax)  bound  on  the  length  of  the  schedule,  we  show  that  we  essen¬ 
tially  need  to  consider  only  the  last  log(2pn)  iterations  of  Step  3. 

Lemma  3.3.3  Suppose  there  are  /  iterations  of  Step  3.  Then  the  length  of  the  schedule  produced 
in  iteration  f  -  i  is  at  least  2l  times  as  long  as  the  length  of  the  schedule  produced  in  iteration 
f  -  i  -  i\og{2pn). 

Proofs  Assume  that  the  (estimated)  job  size  in  iteration  f  —  i  is  q\  then  the  (estimated)  job  size 
in  iteration  f  -i-i  log(2pn)  is  q/(2pn)1.  If  a  job  with  size  q  exists,  then  the  schedule  containing 
it  must  take  time  at  least  q.  As  we  showed  earlier,  a  schedule  produced  by  algorithm  A  for  jobs 
with  size  q/(2pn)1  has  length  at  most  pnq/(2pn)1.  Thus  the  length  of  the  schedule  produced 
when  the  job  size  is  q/(2pn)1  is  at  most  (1/2)*  times  the  length  of  the  schedule  produced  when 
the  job  size  is  q.  ■ 

Since  every  log(2pn)  iterations  the  length  of  the  schedule  produced  doubles,  we  can  “charge” 
iterations  /  -  i  -  flog(2pn),  !<£<(/  —  i)/ log(2pra),  to  iteration  /  -  i.  Since  each  of  the 
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last  log(2pn)  iterations  has  length  no  longer  than  2kC^x,  and  each  gets  charged  no  more  than 
§  + 1  +  •  •  •  +  (4)no/a,»l  times  its  length,  the  overall  length  of  the  schedule  is  at,  most 

A  +  (log tfnWpQJl  + 1  +  •  •  •  +  (ij'sifc1))  <  4 log  n  +  4^1og2»:C;„  +  C" 

which  is  0((logn)C'*lx).  ■ 

Since  there  exists  a  polynomial-time  algorithm  for  the  optimal  scheduling  of  preemptive 
unrelated  machines  due  to  Lawler  and  Labetoulle  [79],  and  there  exists  a  2-approximation  algo¬ 
rithm  for  scheduling  non-preemptive  unrelated  machines  due  to  Lenstra,  Shmoys,  and  Tardos 
[83],  we  have  the  following  corollaries. 

Corollary  3.3.4  There  is  an  on-line  algorithm  for  scheduling  preemptive  unrelated  machines  that 
has  competitive  ratio  4(logn)  +  5. 

Corollary  3.3.5  There  is  an  on-line  algorithm  for  scheduling  non-preemptive  unrelated  machines 
that  has  competitive  ratio  8(logn)  +  17. 

We  can  do  somewhat  better  with  uniformly  related  machines.  As  the  following  lemma 
shows,  by  applying  a  list  scheduling  algorithm  until  there  are  at  most  m  unfinished  jobs  we  can 
quite  easily  reduce  the  number  of  jobs  from  n  to  m. 

Lemma  3.3.6  The  number  of  jobs  in  any  uniformly  related  machine  problem  instance  can  be 
reduced  on-line  from  n  to  m,  while  increasing  the  competitive  ratio  by  1. 

Proof:  We  simply  place  jobs  on  machines  arbitrarily.  Whenever  a  job  completes  and  a  machine 
falls  idle,  we  assign  it  a  new,  unprocessed  job.  When  no  new  jobs  are  available,  at  most  m  jobs 
have  not  yet  finished  processing.  Furthermore,  the  length  of  the  schedule  to  this  point  in  time 
can  be  no  greater  than  Pj!  iCi5i>  which  is  a  lower  bound  on  the  length  of  the  optimal 
preemptive  and  non-preemptive  schedules.  ■ 

Since  there  is  a  polynomial-time  algorithm  for  preemptive  uniformly  related  machines  due 
to  Horvath,  Lam,  and  Sethi  [62],  and  also  a  (1  +  e)-approxmation  algorithm  for  non-preemptive 
uniformly  related  machines  due  to  Hochbaum  and  Shmoys  [61],  the  preceding  theorem  and 
lemma  yield  the  following  corollaries. 
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Corollary  3.3.7  There  is  an  on-line  algorithm  for  scheduling  preemptive  uniformly  related  ma¬ 
chines  that  has  competitive  ratio  4(logm)  -|-  6. 

Corollary  3.3.8  There  is  an  on-line  algorithm  for  scheduling  non-preemptive  uniformly  related 
machines  that  has  competitive  ratio  (4  +  c)(log  m)  +  (4  +  e)  log(2  +  e)  +  2. 

How  many  preemptions/restarts  does  this  general  algorithm  perform?  In  each  iteration  it 
is  possible  that  no  job  finished  and  therefore  there  are  n  preemptions/restarts  at  the  end  of 
the  iteration.  If  pmm  is  the  minimum  job  size  and  r  =  pmax/Pmin>  the  on-line  algorithm  does 
O(log( rpn))  iterations.  This  is  because  the  A  established  in  step  1  can  be  no  smaller  than  pmm; 
we  then  set  q  =  A/pn  and  successively  double  it  until  we  reach  pmax.  Therefore  the  algorithms 
for  unrelated  machines  do  O(nlog(npr))  preemptions/restarts,  and  those  for  uniformly  related 
machines  do  O(m\og(mpr)). 

3.3.2  An  Improved  Algorithm  for  Non-Preemptive  Uniformly  Related  Ma¬ 
chines 

In  the  case  of  non-preemptive  uniformly  related  machines,  we  can  obtain  an  even  better  bound 
when  the  ratio  between  the  speeds  of  the  fastest  and  slowest  machines  is  less  than  m.  Let  R  = 
Si/sm.  We  will  give  an  algorithm  with  competitive  ratio  0(min(logH,logm)).  This  algorithm 
uses  a  new  and  simple  off-line  2-approximation  algorithm  for  uniformly  related  machines. 

A  Simple  (Off-Line)  2-Relaxed  Decision  Procedure  for  Uniformly  Related  Machines 

First  we  give  a  new  (off-line)  2-relaxed  decision  procedure  for  uniformly  related  machines  that 
will  be  the  basis  of  our  on-line  algorithm.  The  notion  of  a  p-relaxed  decision  procedure  was 
used  by  Hochbaum  and  Shmoys  [60]:  given  a  deadline  d,  such  a  procedure  either  produces  a 
schedule  of  length  pd  or  verifies  that  there  exists  no  schedule  of  length  d.  By  using  binary 
search,  a  p-relaxed  decision  procedure  can  be  converted  into  a  p-approximation  algorithm. 

The  2-relaxed  decision  procedure  is  as  follows.  Each  machine  has  an  associated  queue.  Each 
job  is  placed  into  the  queue  of  the  slowest  machine  ra*  such  that  p;-  <  s*d;  that  is,  the  slowest 
machine  that  can  complete  the  job  within  the  given  deadline.  If  for  some  job  there  is  no  such 
machine  it  is  clear  that  there  does  not  exist  a  schedule  of  length  d.  To  construct  a  schedule, 
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whenever  a  machine  is  idle,  it  starts  processing  a  new,  unprocessed  job  from  its  queue.  If 
a  machine’s  queue  is  empty,  it  takes  a  job  to  process  from  the  queue  of  the  fastest  machine 
that  is  slower  than  it  and  that  has  a  nonempty  queue.  If  all  such  queues  are  empty,  then  the 
machine  remains  idle.  If  the  schedule  constructed  has  Cmax  >  2d,  output  no.  Otherwise  we 
have  produced  a  schedule  of  length  at  most  2d. 

In  order  to  prove  that  this  is  a  2-relaxed  decision  procedure,  we  must  prove  that  when  the 
procedure  outputs  no  there  is  no  schedule  of  length  d.  Consider  a  job  j  that  was  not  finished 
by  time  2d.  Since  jobs  are  only  processed  by  machines  on  which  they  take  less  than  d  units  of 
time,  this  job  must  have  started  after  time  d;  thus  it  was  on  the  queue  of  some  machine  m* 
until  time  d.  This  implies  that  until  time  d  machines  mi, . . . ,  m*  were  all  busy  processing  jobs 
that  could  not  have  completed  on  machines  Therefore  in  a  schedule  of  length  d 

it  is  impossible  to  process  all  of  these  jobs  and  job  j,  and  so  there  exists  no  schedule  of  length 
d. 

The  On-line  Algorithm  for  Non-preemptive  Uniformly  Related  Machines 

In  this  section  we  will  first  give  an  0(logi2)-competitive  on-line  algorithm  for  non-preemptive 
uniformly  related  machines.  We  will  then  prove  that  any  problem  instance  can  be  reduced, 
on-line,  to  an  instance  where  R  <  m  while  only  causing  a  slight  increase  in  the  competitive 
ratio  of  an  on-line  algorithm  on  that  instance.  Hence  we  will  have  an  on-line  algorithm  for  all 
instances  with  competitive  ratio  0(min(log  i2,  logm)). 

First  we  present  the  main  algorithm. 

Theorem  3.3.9  Let  I  be  an  instance  of  the  scheduling  problem  for  non-preemptive  uniformly 
related  machines.  Then  there  is  an  on-line  scheduling  algorithm  which  produces  a  schedule  no  longer 
than  16(log  «(X). 

Proof:  We  round  machine  speeds  down  to  the  nearest  power  of  two:  when  a  machine  finishes 
processing  a  job  it  pretends  to  keep  processing  it  long  enough  so  that  it  seems  to  have  been 
processed  at  the  lesser  speed.  When  we  interpret  the  schedule  for  this  rounded  problem  instance 
as  a  schedule  for  the  actual  problem  instance,  the  competitive  ratio  can  be  increased  by  at 
most  a  factor  of  two.  Since  the  s,-  are  all  powers  of  two,  and  all  the  s,-  are  within  a  factor 
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of  R  of  sj,  it  immediately  follows  that  there  are  at  most  log  R  different  machine  speeds.  Let 
Mi  =  {mi\Si  =  st},  M2  =  {m,|s;  =  Si/2},...,M]ogR  =  {m;|sf  =  Si/2lo*H). 

Our  strategy  will  be  to  first  convert  the  off-line  2-relaxed  decision  procedure  into  an  on-line 
21ogJ2-relaxed  decision  procedure,  and  then  build  from  that  an  on-line  algorithm.  The  off-line 
decision  procedure  does  not  immediately  lend  itself  to  an  on-line  procedure,  since  the  criterion 
it  uses  to  assign  jobs  to  machine  queues  utilizes  knowledge  of  the  job  sizes.  To  convert  this 
to  an  on-line  decision  procedure  we  will  repeatedly  run  the  off-line  relaxed  decision  procedure 
to  either  schedule  a  job  or  else  update  the  estimate  of  its  size.  Note  that  given  the  rounded 
machine  speeds,  instead  of  queueing  jobs  on  machines  m  1, . . . ,  mm,  we  can  instead  queue  jobs 
on  sets  of  machines  Mi, ...,  M\ogR. 

A  formal  description  of  an  on-line  2  log  i£- relaxed  decision  procedure  is  as  follows.  The 
procedure  either  outputs  no  if  there  is  no  schedule  of  length  d  or  it  produces  a  schedule  of 
length  2d  log  R.  Note  that  even  if  it  answers  no  the  procedure  may  have  completely  processed 
some  of  the  jobs  in  that  time. 

Input  A  set  of  jobs  and  a  deadline  d. 

Step  0  Put  all  jobs  into  the  M|0gR  queue. 

Step  1  Run  the  off-line  2-relaxed  decision  procedure,  with  the  modification  that  no  jobs  are 
started  after  time  d  (that  is,  when  a  machine  is  idle  it  takes  a  job  to  process  off  of  its 
queue,  or,  when  its  queue  is  empty,  off  of  the  first  slower  machine  that  has  a  non-empty 
queue;  etc.) 

Step  2  1.  If  all  jobs  finish  processing  by  time  2d  stop. 

2.  If  any  machine  in  Mi  is  still  processing  a  job  at  time  2d  then  there  is  no  schedule  of 
length  d.  Output  no;  return. 

3.  If  any  set  of  machines  Mt  has  a  job  j  in  its  queue  at  time  d  then  there  is  no  schedule 
of  length  d.  Output  no;  return. 

4.  If  there  are  jobs  that  are  being  processed  at  time  2d,  on  machines  in  Mi,  i  >  1, 
cancel  these  jobs  and  put  them  on  the  queue  of  M,_t.  Go  to  Step  1. 
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To  prove  that  this  is  an  on-line  2  log  R- relaxed  decision  procedure  notice  first  that  the  length 
of  the  schedule  or  partial  schedule  produced  by  this  procedure  is  no  longer  than  2d  log  R,  since 
the  off-line  relaxed  decision  procedure  produces  a  schedule  of  length  at  most  2d  and  it  is  run  at 
most  log  R  times.  Furthermore,  despite  the  fact  that  the  pj  are  unknown,  the  on-line  relaxed 
decision  procedure  maintains  the  invariant  that  a  job  is  only  on  the  queue  of  Mk  if  it  could  not 
complete  in  time  d  on  any  machine  in  Afjfc+i,...,M>ogfl.  This  is  certainly  true  initially,  since 
all  the  jobs  are  put  in  the  queue  of  M\osR.  Furthermore,  since  the  procedure  does  not  start 
new  jobs  after  time  d,  any  job  that  is  still  being  processed  at  time  2d  on  some  machine  in  Mi 
must  take  more  than  time  d  to  process  on  any  machine  in  M,-.  Therefore  any  such  job  does  not 
belong  on  the  queue  of  or  any  slower  set  of  machines,  and  is  placed  on  the  queue  of  M,-_  j. 

Now  we  will  show  that  if  the  procedure  outputs  no  then  there  is  no  schedule  of  length  d. 
If  condition  2  is  true  then  a  machine  in  Mi  ran  a  job  for  more  than  d  units  of  time;  therefore 
this  job  could  not  have  been  processed  in  d  units  of  time  on  any  of  the  machines,  since  no  other 
machine  runs  at  a  faster  speed.  If  condition  3  is  true,  then  up  until  time  d  all  machines  in  the 
sets  Mi,...Mk  must  have  been  busy  processing  jobs  that  could  not  have  been  processed  in  d 
units  of  time  on  machines  in  Mt+1,...,M|0gfl.  Therefore,  machines  in  could  not 

have  processed  all  of  these  jobs  and  job  j  as  well  by  time  d. 

We  have  given  an  on-line  21ogi2-relaxed  decision  procedure;  we  now  show  how  to  use  it  to 
develop  an  on-line  0( log  .^-approximation  algorithm. 

The  on-line  algorithm  initially  establishes  a  lower  bound  A  on  C'mtkX  by  running  an  arbitrarily 
chosen  job  on  the  fastest  processor.  Let  A  be  the  time  taken  to  complete  this  job;  certainly 
A  <  C^ax.  Next,  the  on-line  algorithm  calls  the  procedure  on  the  set  of  all  jobs  with  d  —  A.  If 
the  procedure  returns  no,  then  we  will  call  it  again  with  d  =  2A  and  the  set  of  jobs  that  were 
not  completely  processed  in  the  first  iteration.  In  general,  if  the  ith  iteration  fails  to  produce  a 
schedule,  then  we  will  call  the  procedure  again  for  the  (i  +  l)st  time  with  d  =  2'A  and  all  jobs 
that  have  not  yet  been  completely  processed.  Observe  that  if  the  ith  iteration  fails  to  produce 
a  schedule  when  called  with  d  =  2,_1A,  then  it  proves  that  2,_lA  <  C"ax.  Suppose  that  we 
finally  finish  processing  all  jobs  in  iteration  /.  Then  the  total  length  of  the  schedule  produced 
is 
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2  •  (A  +  (1  +  2  +  •  •  •  +  2/“1)(2A  log  A))  <  2/+2A  log  R. 

Since  the  procedure  failed  to  produce  a  schedule  in  iteration  /—  1,  we  know  that  2/-2A  <  C^. 
Therefore  the  total  length  of  the  schedule  produced  is  no  greater  than  16(log  R)C^X.  ■ 

If  we  consider  only  the  k  fastest  machines,  where  k  is  defined  as  the  smallest  k  such  that 
such  that  £,*=1  Si  >  |  s,-,  then  R  =  sj/s*  <  m.  We  now  show  that  we  can  assume  that  that 

the  above  algorithm  uses  only  the  k  fastest  machines.  In  time  2 we  can  process  on-line  all 
but  k  of  the  jobs  by  processing  jobs  arbitrarily  on  machines  until  the  first  moment 

in  time  at  which  at  most  A;  jobs  have  not  yet  been  completely  processed.  The  amount  of  time  it 
takes  until  this  point  is  bounded  above  by  (52j=i  Pj)/(i  HHi  s>)  ^  2C^ax,  since  none  of  the  k 
machines  is  idle.  We  will  only  need  machines  mi, . . . ,  m*  to  process  these  last  k  jobs.  If  we  then 
produce  a  schedule  of  length  /  for  the  last  k  jobs  on  these  machines,  the  entire  schedule  will 
be  of  length  +  /.  Therefore,  without  loss  of  generality,  we  can  assume  that  the  machine 
speeds  satisfy  R  <  m. 

Corollary  3.3.10  There  is  an  on-line  algorithm  for  scheduling  non-preemptive  uniformly  related 
machines  that  has  competitive  ratio  min(161og.ft,  lfilogm  +  2). 

To  bound  the  number  of  restarts  of  this  on-line  algorithm,  observe  that  in  any  itera¬ 
tion  of  the  on-line  relaxed  decision  procedure  at  most  0(m)  jobs  will  be  restarted  O(logil) 
times.  The  on-line  relaxed  decision  procedure  is  run  at  most  0(log(C'^iaxSi/pmjn))  times,  since 
the  initial  candidate  deadline  is  at  least  pmj„/si  and  we  successively  double  the  deadline  un¬ 
til  we  reach  a  feasible  deadline,  which  C^ax  certainly  is.  Therefore  this  algorithm  performs 
0(m  log  R  log(C*  axS! /pmin))  restarts. 

3.4  Lower  Bounds 

3.4.1  Identical  Machines 

As  with  other  on-line  algorithms,  on-line  scheduling  can  be  viewed  as  a  game  against  an  adver¬ 
sary  who  is  allowed  to  determine  the  information  that  is  revealed  incrementally  to  the  algorithm. 
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Therefore,  our  lower  bound  arguments  will  often  be  phrased  in  terms  of  a  strategy  for  an  ad¬ 
versary,  who  attempts  to  reveal  information  in  such  a  way  as  to  force  the  competitive  ratio  to 
be  as  large  as  possible.  We  will  consider  two  possible  types  of  adversaries.  The  stronger  is  an 
adaptive  adversary ,  who  knows  in  advance  the  scheduling  algorithm  and  also  knows  in  advance 
the  result  of  any  coin  tosses  of  the  algorithm.  The  weaker  is  an  oblivious  adversary,  who  knows 
only  the  algorithm  but  not  the  coin  tosses  [100,  7].  Note  that  for  deterministic  algorithms  this 
distinction  is  clearly  irrelevant. 

The  oblivious  adversary  models  the  situation  where  a  randomized  algorithm  receives  a 
problem  instance  and  must  produce  a  solution;  the  random  choices  it  makes  have  no  effect 
on  the  input  it  sees.  On  the  other  hand,  the  adaptive  adversary  models  the  situation  where 
there  is  some  feedback  between  the  choices  the  algorithm  makes  and  future  input  it  sees.  A 
good  example  of  this  latter  situation  is  paging:  depending  on  what  random  choices  a  paging 
algorithm  makes  it  may  or  may  not  itself  cause  page  faults,  in  addition  to  those  caused  by  other 
elements  of  the  operating  system. 

We  begin  our  lower  bounds  with  a  lower  bound  on  the  competitive  ratio  of  any  on-line 
algorithm  for  scheduling  identical  machines. 

Theorem  3.4.1  The  competitive  ratio  of  any  deterministic  on-line  algorithm  for  scheduling  iden¬ 
tical  machines,  with  no  preemption  allowed,  is  at  least  (2  -  ^). 

Proof:  For  any  m,  let  n  =  m(m  -  1)  + 1.  Each  of  the  first  m(m  -  1)  jobs  is  of  size  1,  while  the 
last  job  is  of  size  m;  that  is,  px  =  •  •  •  =  p„_j  =  l,pn  =  m.  This  instance  is  due  to  Graham  [52]. 
The  optimal  schedule  is  of  length  m,  and  consists  of  scheduling  the  last  job  on  a  machine  by 
itself,  and  scheduling  m  of  the  single  unit  jobs  on  each  of  the  remaining  m  -  1  machines.  The 
length  of  a  schedule  for  this  instance  is  determined  by  the  starting  time  of  the  job  of  size  m; 
therefore  the  adversary  wishes  to  make  it  start  as  late  as  possible.  Each  of  the  first  n  —  1  jobs 
that  the  scheduler  allows  to  run  for  at  least  one  unit  of  time  will  be  fixed  by  the  adversary  to 
be  jobs  of  size  1.  Given  this  strategy  of  the  adversary,  it  is  not  difficult  to  see  that  by  time  i, 
1  <  i  <  m  -  1,  at  most  im  jobs  are  either  completely  processed  or  currently  being  processed. 
Hence  by  time  m  -  1  there  must  be  one  job  that  has  not  been  completely  processed  and  is  not 
currently  being  processed.  The  adversary  sets  this  job  to  be  of  size  m.  If  this  job  starts  at  time 
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m  —  1  the  fastest  the  schedule  can  complete  is  by  time  2m  -  1,  which  is  (2  —  -L)  times  as  long 
as  the  optimal  schedule.  ■ 

In  contrast  to  the  nonpreemptive  model  an  optimal  preemptive  schedule  can  be  found  off¬ 
line  in  polynomial  time  [89].  Interestingly  enough,  an  argument  similar  to  the  previous  proof 
shows  that  the  on-line  worst-case  characterization  of  both  models  is  the  same. 

Theorem  3.4.2  Any  deterministic  on-line  algorithm  for  scheduling  identical  machines  with  pre¬ 
emption  allowed  has  competitive  ratio  at  least  (2  -  -L). 

Proof:  Consider  an  instance  with  n  =  m+ 1  jobs.  The  adversary  allows  the  scheduler  to  begin 
scheduling,  and  waits  until  either  the  scheduler  preempts  a  job  for  the  first  time,  or  1  time  unit 
has  passed,  whichever  comes  first.  Call  this  time  t.  By  time  t,  at  most  m  jobs  can  have  been 
started  (since  scheduler  didn’t  preempt  anything  until  time  t).  Let  job  n  be  a  job  that  was 
not  started.  At  time  t,  the  adversary  sets  pt  =  •••  =  pn_l  =  t ,  and  sets  pn  =  tm/(m  -  1). 
The  scheduler  can  clearly  complete  the  entire  schedule  no  sooner  than  time  t  +  tm/(m  -  1). 
The  length  of  the  optimal  preemptive  schedule  is  known  to  be  max(pmBX,  £  P;7m)'  I*1  this  case 
rnax(pmax,]r]pj7m)  =  tm/(m-  1).  Therefore  the  adversary  has  competitive  ratio  [t  +  tm/(m- 
l)]/[im/(m-l)]  =  (2-^).  ■ 

The  essence  of  these  deterministic  lower  bounds  is  that  there  is  one  large  job  whose  starting 
time  determines  the  length  of  the  schedule,  and  the  adversary  can  force  the  scheduler  to  start 
that  job  late  in  the  schedule.  When  we  move  to  randomized  algorithms  it  is  true  that  an  adaptive 
adversary  can  force  the  randomized  scheduler  to  do  as  poorly  as  the  deterministic  scheduler. 
One  might  imagine,  however,  that  a  randomized  algorithm  A  working  against  an  oblivious 
adversary  might,  with  significant  probability,  select  and  schedule  the  large  jobs  earlier,  thus 
doing  better.  (It  is  known,  for  example,  that  an  algorithm  that  schedules  jobs  in  nonincreasing 
size  order  produces  a  schedule  of  length  no  longer  than  [53].)  A  randomized  algorithm 

can  clearly  gain  something  over  a  deterministic  algorithm:  consider,  for  example  the  algorithm 
that  randomly  chooses  an  ordering  of  the  jobs  and  then  list  schedules  according  to  that  ordering. 
Since  each  list  schedule  is  at  most  (2  -  times  optimal  in  length,  and  at  least  one  of  the  list 
schedules  will  be  the  optimal  schedule,  this  randomized  algorithm  has  expected  performance 
strictly  less  than  (2  -  ^).  We  will  prove,  however,  that  randomness  is  ultimately  of  little  help 
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to  a  non-preemptive  on-line  scheduler  of  identical  machines. 

Theorem  3.4.3  Any  randomized  on-line  algorithm  for  non-preemptive  scheduling  of  identical 
machines,  working  against  an  oblivious  adversary,  has  worst  case  expected  value  of  at  least 
(2  ” 

Our  strategy  to  prove  this  theorem  is  as  follows.  We  will  first  define  the  notion  of  a  reasonable 
randomized  algorithm  for  scheduling  identical  machines.  We  will  then  show  that  for  any  c- 
competitive  unreasonable  algorithm,  there  exists  a  reasonable  algorithm  that  has  a  competitive 
ratio  no  greater  than  c  and  that  always  chooses  the  next  job  to  schedule  uniformly.  Finally, 
we  will  provide  an  instance  for  which  the  competitive  ratio  of  such  a  strategy  has  worst  case 
expected  value  (2  -  0(7^))^. 

Definition  3.4.4  A  reasonable  randomized  algorithm  for  scheduling  identical  machines  is  an 
algorithm  that  does  not  restart  any  job  and  does  not  leave  any  machine  idle  so  long  as  there  is  some 
job  that  has  not  yet  been  started. 

Lemma  3.4.5  For  any  unreasonable  algorithm  A  there  is  a  reasonable  algorithm  A'  whose  worst- 
case  expected  performance  is  at  least  as  good  as  that  of  A. 

Proof:  First  we  argue  that  the  introduction  of  idle  time  into  a  schedule  cannot  help  the 
scheduler.  Assume  that  job  j  is  to  be  started  at  time  t2  on  machine  i  which  is  idle  from  time 
t\  to  t2.  Now  if  job  j  is  available  at  time  <1,  it  is  clearly  to  the  advantage  of  the  scheduler  to 
start  job  j  on  machine  i  at  time  t\.  If  job  j  is  not  available  at  time  t\  then  it  is  running  on 
another  machine  i'.  In  this  case  there  is  no  point  in  restarting  job  j  on  machine  t;  since  the  two 
machines  are  of  identical  speed  we  can  switch  the  future  schedules  of  the  two  machines  without 
increasing  the  total  length  of  the  schedule.  Now  restarting  a  job  j  after  it  has  run  for  t  <  pj 
units  of  time  is  equivalent,  in  terms  of  the  effect  on  the  length  of  the  schedule,  to  introducing 
t  units  of  idle  time,  and  thus  does  not  help  the  scheduler  either. 

Lemma  3.4.6  A  reasonable  randomized  algorithm  A  is  equivalent  to  an  algorithm  that,  whenever 
a  machine  becomes  idle,  picks  one  of  the  unstarted  jobs  with  a  certain  probability  distribution  which 
may  depend  on  the  schedule  constructed  up  to  that  point. 
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Proof:  Since  a  reasonable  randomized  algorithm  constructs  a  schedule  with  no  restarts  and  no 
idle  time,  it  must  be  the  case  that  it  schedules  some  unstarted  job  whenever  a  machine  becomes 
idle.  The  probability  distribution  for  its  next  choice  cannot  depend  on  information  that  the 
algorithm  does  not  have  at  that  point;  thus,  it  can  depend  only  on  the  schedule  constructed 
until  that  particular  choice  of  a  job.  ■ 

We  will  now  argue  that  the  adversary  can  always  force  the  scheduler  to  do  as  poorly  as  it 
would  have  done  had  it  always  made  its  choices  according  to  the  uniform  distribution. 

Lemma  3.4.7  The  competitive  ratio  of  a  reasonable  randomized  algorithm  A  can  be  no  less  than 
that  of  the  reasonable  algorithm  U  that  always  picks  the  next  job  to  process  uniformly  from  among 
the  remaining  jobs. 

Proof:  We  note  that  the  adversary’s  strategy  can  be  described  as  choosing  the  sizes  of  the 
jobs  and  then  choosing  some  permutation  of  the  jobs.  If  the  adversary  chooses  the  permutation 
randomly  and  uniformly,  then  the  probability  of  the  algorithm  A  selecting  any  particular  job 
is  uniform  over  all  remaining  jobs,  no  matter  what  probability  distribution  A  uses.  Let  i  be 
the  expected  performance  of  algorithm  A  against  this  adversary,  where  the  expectation  is  taken 
over  the  random  choices  of  both  A  and  the  adversary.  Note  that  the  adversary  can  always 
choose  some  permutation  of  jobs  such  that  the  expected  performance  of  4,  taken  over  just  the 
choices  of  A ,  is  no  better  than  S.  Since  the  expected  performance  of  the  algorithm  U  that 
chooses  uniformly  is  £  no  matter  which  permutation  is  used,  algorithm  A  can  have  competitive 
ratio  no  better  than  algorithm  U.  ■ 

We  complete  the  proof  of  theorem  3.4.3  by  showing  that  scheduling  by  choosing  the  next 
job  uniformly  can  do  quite  poorly. 

Lemma  3.4.8  There  is  a  problem  instance  for  scheduling  identical  machines  on  which  a  uniform 
choice  of  the  next  job  to  process  produces  a  schedule  with  expected  length  (2  -  0(^~»))C’max- 

Proof:  We  will  consider  the  problem  instance  with  k  jobs  of  size  m  (“big”  jobs)  and  m(m-k ) 
jobs  of  size  1  (“small”  jobs).  The  optimum  length  schedule  for  this  instance  is  of  length  m. 
The  expected  length  of  the  schedule  is  then  m  +  E,,  where  Es  is  the  expected  start  time  of 
the  last  big  job  in  the  schedule.  To  bound  Es ,  we  can  think  of  the  problem  as  a  “shell  game”, 
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where  there  are  n  =  m(m  -  k)  +  k  shells,  under  k  of  which  there  are  peas.  If  one  searches 
for  the  peas  by  choosing  among  the  remaining  shells  randomly  and  uniformly,  the  expected 
place  of  the  last  pea  to  be  found  is  jr^y(n).  Therefore  we  expect  the  ktb.  largest  job  to  be  the 
-  k)  +  &]th  chosen  overall.  This  will  happen  no  earlier  than  time  -^(m  —  k),  since 
at  most  m  jobs  are  completed  during  every  unit  of  time.  Therefore  E,  >  -  k)  =  F(k). 

To  derive  the  strongest  lower  bound  possible  we  maximize  F(k)  by  setting  the  first  derivative 
to  0. 


so  we  wish  to  solve 


F'(k)  = 


(m  -  2 k)(k  +  1)  -  ( k(m  -  k )) 

w+w 


(m  -  2 k)(k  +  1)  -  (k(m  -  k ))  =  — k 2  —  2k  +  m  =  0. 

This  implies  that  k  =  \/m  +  1  -  1;  plugging  into  F(k)  we  see  that 

m  =  ■  (m  ~  + 0 

-  mWm  4- 1)  -  (m  +  1)  +  y/m+T  -  m  +  y /to  +1-1 

\Jm  +  1 

_  my/m  +  1  +  2  y/m  +  1  -  2m  -  2 

y/m  +  1 

2 _  _2  _  2  \ 

y/m  +  1  m  m(\/m  +  1)  / 

Thus  m  +  E,  =  m(2  -  +  o(^i-.))  =  (2  -  +  o(y^p*))C^ax  which  implies  the 

stated  result.  ■ 

3.4.2  Uniformly  Related  Machines 

In  the  case  of  uniformly  related  machines  the  situation  becomes  significantly  more  difficult 
for  the  scheduler.  We  will  show  that  the  adversary  can  force  any  deterministic  scheduler  to 
construct  a  schedule  of  length  Q(logm)  times  the  length  of  the  optimal  schedule,  whether  or 
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not  the  scheduler  is  allowed  to  preempt  jobs. 

Theorem  3.4.9  The  competitive  ratio  of  any  deterministic  on-line  scheduling  algorithm  for  uni¬ 
formly  related  machines,  whether  or  not  preemption  is  allowed,  is  ft(logm)  . 

Our  proof  will  be  in  terms  of  a  preemptive  scheduler,  but  it  is  not  hard  to  see  that  a  similar 
argument  will  work  when  no  preemption  is  allowed. 

To  prove  this  theorem  we  use  a  family  of  instances  lk  for  uniformly  related  machines  given 
by  Cho  and  Sahni  [22]  in  a  somewhat  different  context.  Let  k  =  (log2(3m  -  1)  +  l)/2.  We 
restrict  ourselves  to  values  of  m  such  that  k  is  integral.  The  instance  Tk  has  k  sets  of  machines 
G{  and  k  sets  of  jobs  T],  1  <  i  <  k.  Each  machine  in  (?,-  has  a  speed  of  2*  and  each  job  in  T; 
has  size  2‘.  Finally,  |(?,j  =  |Tf|  =  22i-2‘-1  for  1  <  i  <  k ,  and  |Gjt|  =  |Tt|  =  1.  It  is  easy  to  see 
that  C;ax  =  1,  since  each  job  of  size  pj  can  be  scheduled  by  itself  on  a  machine  of  speed  pj. 

Let  X{  be  the  time  when,  in  a  schedule  for  lk ,  the  last  job  in  T;  finishes. 

Lemma  3.4.10  The  adversary  can  always  force  the  scheduler  to  construct  a  schedule  for  Ik  in 
which  X\  <  X?  <  . . .  <  Xk. 

Proof:  Assume  that  the  adversary  is  competing  against  a  scheduler  who  somehow  knows 
the  job  sizes  in  advance,  but  doesn’t  know  which  size  belongs  to  which  job.  Certainly  if  the 
adversary  can  force  this  type  of  scheduler  to  do  badly,  the  adversary  can  force  a  scheduler  with 
no  knowledge  of  job  sizes  to  do  badly.  We  introduce  the  idea  of  the  adversary  committing  to  a 
set  of  jobs.  At  time  t ,  let  J(t)  be  the  set  of  jobs  that  have  not  yet  completed.  The  scheduler  has 
a  corresponding  set  L{t)  of  the  sizes  of  the  jobs  which  have  not  yet  completed.  The  adversary 
is  not  committed  to  any  job  in  J(t)  if,  given  the  amount  of  time  that  the  jobs  in  J(t )  have  been 
running,  the  scheduler  cannot  infer  any  information  about  which  job  in  J(t)  is  associated  with 
which  size  in  L(t).  More  formally,  the  adversary  is  not  committed  to  any  job  in  J(t)  if,  at  time 
t ,  any  bijective  mapping  from  J(t)  to  L(t)  is  valid  given  the  schedule  thus  far.  Let  R(i,t)  be 
the  total  amount  of  processing  that  has  been  done  on  the  job  that  is  running  on  machine  i  at 
time  t.  If  the  adversary  is  not  committed  to  any  job  in  J(t)  at  time  t  then 


R(i,t)  <  min  p.-,  1  <  i  <  m. 
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The  adversary’s  strategy  is  to  avoid  being  committed  to  any  job  in  J(t).  The  adversary  can 
do  this,  if,  at  any  point  in  time  t'  such  that  R(i,t')  =  min f°r  some  t,  the  adversary 
allows  the  smallest  job  in  J(t')  to  complete  on  machine  i.  If  the  equality  holds  true  for  more 
than  one  machine  i  or  more  than  one  job  j,  then  the  smallest  indexed  job  j  completes  on  the 
smallest  indexed  machine  i  and  so  forth.  The  adversary  continues  to  complete  jobs  until  the 
inequality  R(i,  t')  <  min j£j(t')Pj  holds  again.  It  is  clear  that  this  strategy  yields  a  schedule 
satisfying  the  condition  of  the  lemma.  ■ 

Lemma  3.4.11  The  adversary  can  always  force  the  scheduler  to  produce  a  schedule  for  2*  in 
which  X{  -  Xi-x  >1,  1  <  i  <  k,  X0  =  0. 

Proof:  The  adversary  uses  the  same  strategy  as  in  the  previous  proof.  Consider  the  status  of 
the  jobs  in  TW  at  time  X{.  None  of  them  have  been  completed;  in  fact,  no  more  than  2*  of 
the  2,+l  units  of  each  job  have  been  processed.  This  is  because  until  all  the  jobs  in  7<  were 
completed  (at  time  Xj),  any  job  that  had  2‘  units  processed  was  designated  by  the  adversary 
to  be  in  71  and  was  thus  finished.  Therefore  there  are  |7}+t|  jobs,  each  with  remaining  work  of 
at  least  2‘  units  each.  How  quickly  might  these  all  complete? 

Since  there  are  more  than  |71+i|  machines  in  the  sets  Gi+i,Gi+z,  ...(?*,  in  an  optimal 
schedule  there  is  no  need  to  run  one  of  the  jobs  in  T;+i  on  a  machine  in  <_?,•  or  slower.  At 
best,  processing  all  the  jobs  from  71+i  on  all  the  machines  in  G,-+i  and  faster  must  take  time 
at  least  the  sum  of  the  remaining  processing  requirements  of  the  jobs  in  71+1  over  the  sum  of 
the  processing  speeds  of  processors  in  G,+i  or  faster. 

So  the  time  taken  is  at  least 

^j€T,+i  Pj/2  _  2‘|21+1| 

E»€uf+IG,  sl  E/=i+l 

22i--3 

EfcAl  2'  •  22fc-2'“1  +  2* 

22fc-i-3 

2*Er*:r22r+2* 

22*-,-3 

2k(2k~'~l  -  1)  +  2* 
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22i-,-3 

=  22*-'-1 
1 

"  4’ 


The  fl(log  m)  lower  bound  follows  directly  from  these  lemmas. 

Theorem  3.4.12  The  competitive  ratio  for  any  deterministic  on-line  algorithm  for  scheduling 
uniformly  related  machines  with  R  =  si/sm  <  m  is  H(log  R),  whether  or  not  preemption  is  allowed. 

Proof:  We  modify  our  previous  proof  slightly.  Let  R'  be  the  largest  power  of  2  less  than  or 
equal  to  R.  We  again  let  k  -  (log(3m  -  1)  +  l)/2,  and  restrict  ourselves  to  values  of  m  such 
that  k  is  integral.  The  instance  I*  has  k  sets  of  machines  Gi  and  k  sets  of  jobs  T(,  1  <  i  <  k. 
|(?i|  =  |7i|  =  22t-J,'“1  for  1  <  i  <  k,  and  |G*|  =  (2*1  =  1.  Once  again  each  job  in  T(  has 
size  2';  the  only  difference  is  that  each  machine  in  Gi  has  a  speed  of  2*  for  i>  k  -  log  R'\  the 
machines  G,-,  i  <  k  -  log  R\  all  have  the  same  speeds  as  the  machines  in  (?*_ \osr',  namely 
2 k/R'.  Thus  si/sm  for  this  instance  is  R'.  Again,  the  optimal  schedule  length  is  1.  By  using 
the  same  strategy  as  in  the  previous  proof,  the  adversary  can  force  the  scheduler  to  construct  a 
schedule  for  I*  in  which  Xi  <  X?  <  •  •  •  <  A-*,  and  in  which  X ,•  -  X,_i  >  k  -  log  R'  <  i  <  k , 
X0  =  0.  All  of  the  “small”  jobs  may  finish  quite  quickly,  but  those  in  Tk~\0%R',  •  •  m2*  will  each 
take  at  least  |  unit  of  time  to  complete.  ■ 

3.5  On-line  Algorithms  for  Other  Scheduling  Models 

3.5.1  Shop  Scheduling 

The  juxtaposition  of  the  results  in  this  and  the  last  chapter  raises  the  question  of  how  well  can 
one  schedule  shop  problems  on-line;  specifically,  how  well  can  one  schedule  if  the  processing 
time  of  each  operation  of  a  job  is  unknown  until  the  completion  of  that  operation,  but  the 
number  of  operations  of  each  job  and  the  order  constraints  on  their  execution  are  known  in 
advance? 

The  most  naive  on-line  strategy  for  shop  problems  is  to  arbitrarily  choose  jobs  to  process, 
subject  to  the  condition  that  no  machine  sits  idle  when  there  is  some  operation  of  a  job  which 
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it  could  be  processing.  We  have  already  noted  that  for  the  flow  shop  and  job  shop  problems 
such  a  strategy  yields  a  m-approximation  algorithm,  and  furthermore  that  this  bound  is  tight. 
For  the  open  shop  problem  we  have  proved  that  such  a  strategy  is  a  2-approximation  algorithm, 
and  have  shown  that  it  can  do  as  badly  as  (2  -  ^).  Our  belief  is  that  these  bounds  are  close  to 
the  the  best  one  can  do  on-line;  we  give  one  small  theorem  in  that  direction. 

Theorem  3.5.1  No  deterministic  on-line  algorithm  for  open  shop  scheduling  has  a  competitive 
ratio  better  than  f . 

Proof:  First  consider  a  simple  example  with  two  machines  and  three  jobs,  and  let  A  be  a 
deterministic  on-line  open  shop  scheduling  algorithm.  Each  job  has  an  operation  on  both 
machines;  the  adversary  will  determine  the  sizes  of  these  operations  based  on  the  strategy  of 

Assume  A  begins  by  starting  two  jobs  on  machines  1  and  2,  without  loss  of  generality  job 
1  on  machine  1  and  job  2  on  machine  2.  Let  t  be  the  first  point  in  time  at  which  A  interrupts 
a  job;  if  A  never  does  let  t  =  1.  We  define  job  1  to  have  an  operation  of  size  t  on  machine  1 
and  an  operation  of  size  0  on  machine  2;  similarly  job  2  is  defined  to  have  an  operation  of  size 
t  on  machine  2  and  an  operation  of  size  0  on  machine  1.  Job  3  is  defined  to  have  an  operation 
of  size  t  on  each  machine.  The  optimal  schedule  length  is  2 1,  but  A  will  not  start  job  3  until 
time  t  and  will  produce  a  schedule  of  length  3 1. 

If  A  does  not  start  two  jobs  at  time  0  let  t  be  the  point  at  which  it  starts  a  job  on  the  second 
machine.  Let  job  1  have  an  operation  of  size  1  on  machine  1;  let  job  2  have  an  operation  of  size 
t  on  machine  2;  let  all  other  operations  be  of  size  0.  Then  A  produces  a  schedule  of  length  at 
least  2 1  when  the  optimal  length  is  t. 

To  extend  this  example  to  any  even  number  m  of  machines,  just  duplicate  the  2-machine 
example  y  times.  ■ 

We  can  also  give  the  same  lower  bound  in  the  on-line  scheduling  scenario  where  jobs  have 
unknown  release  dates  but  once  they  arrive  they  are  fully  specified. 

Theorem  3.5.2  No  deterministic  algorithm  for  open  shop  scheduling  with  unknown  release  dates 
but  fully  specified  jobs  (on  arrival)  has  competitive  ratio  better  than  |. 
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Proof:  We  again  give  a  simple  two  machine  example  that  can  be  duplicated  y  times.  At  time 
0  there  is  just  one  job  released,  that  has  an  operation  of  size  1  on  each  of  the  two  machines. 
At  time  1  release  a  job  which  has  one  operation  of  size  1  on  whichever  machine  the  scheduling 
algorithm  left  idle  from  time  0  to  1.  The  optimal  length  of  a  schedule  for  this  instance  is  2,  but 
the  algorithm  will  produce  a  schedule  of  length  at  least  3.  ■ 

3.5.2  Precedence  Constraints 

As  we  discussed  in  section  1.3,  an  important  element  in  a  model  of  parallel  processor  scheduling 
is  precedence  constraints  between  the  jobs,  that  form  a  partial  order  (dag)  reflecting  the  logical 
flow  of  information  in  the  program.  We  refer  to  the  levels  of  the  dag,  where  level  0  contains 
jobs  that  are  not  preceded  by  any  other  jobs.  A  reasonable  on-line  model  of  this  element  is 
that  when  a  job  arrives,  despite  the  fact  that  its  size  is  unknown,  the  precedence  constraints 
between  it  and  any  other  previously  known-about  job  are  known  as  well.  Further,  we  must 
assume  that  when  a  job  in  level  i  of  the  dag  arrives,  the  jobs  in  levels  0, . . . ,  t  -  1  of  the  dag 
have  already  arrived. 

Graham  proved  in  1966  that  list  scheduling  is  a  (2-^)-approximation  algorithm  for  schedul¬ 
ing  identical  machines  even  with  precedence  constraints.  Since  we  have  proved  a  lower  bound 
of  (2-  -L)  without  precedence  constraints,  (2  -  is  a  tight  bound.  For  uniformly  related  ma¬ 
chines,  the  current  situation  is  much  bleaker,  since  the  best  off-line  approximation  algorithm 
for  uniformly  related  machines  with  precedence  constraints  is  only  an  0(v/in)-approximation 
algorithm;  this  algorithm,  however,  happens  to  be  an  on-line  algorithm  as  well. 

3.6  Conclusions  and  Open  Problems 

The  most  obvious  open  problem  raised  by  our  work  is  to  close  the  gap  between  the  upper 
bound  of  0(log  n)  and  the  lower  bound  of  fl(log  m)  for  unrelated  machines.  All  that  would  be 
necessary  to  do  this  would  be  a  “preprocessing  algorithm'’  that  reduced  the  number  of  jobs  to 
a  number  polynomial  in  m.  For  uniformly  related  machines,  we  showed  that  list  scheduling 
of  the  first  n  -  m  jobs  accomplishes  this  goal.  It  is  not  clear,  however,  that  a  similar  naive 
approach  will  be  of  use  for  unrelated  machines. 
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Other  issues  we  have  not  fully  resolved  are  the  usefulness  of  randomization  in  uniformly 
related  and  unrelated  machines,  and  an  exact  characterization  of  the  power  of  the  on-line  shop 
scheduler.  Another  direction  would  be  average  case  analyses  of  on-line  algorithms  for  these 
problems.  With  regard  to  scheduling  of  parallel  identical  machines,  Bruno  and  Downey  [18] 
proved  that  when  the  py  are  independently  and  uniformly  distributed  over  the  interval  [0, 1], 

lim)Prob[maxpj7^Pj  >  =  0. 

This  implies  that  when  n  grows  faster  than  m,  list  schedules  are  asymptotically  optimal  with 
probability  that  goes  to  1.  Coffman  and  Gilbert  refined  and  extended  these  results  to  other 
distributions  [35].  However,  in  contrast  to  the  relative  error,  the  absolute  error  does  not  tend 
to  0  as  n  — ►  oo,  with  m  fixed.  This  stronger  criterion  of  optimality  is  satisfied  by  the  (off¬ 
line)  algorithm  that  schedules  jobs  in  order  of  longest  processing  time.  It  would  be  interesting 
to  prove  that  no  on-line  algorithm  can  do  better  than  list  scheduling  in  this  regard;  it  would 
also  be  interesting  to  carry  out  similar  analyses  of  on-line  algorithms  for  uniformly  related  and 
unrelated  machines. 

We  mentioned  earlier,  in  the  context  of  paging  and  list  maintenance,  that  the  conclusions 
drawn  from  average-case  analysis  of  on-line  algorithms  do  not  always  correspond  to  the  con¬ 
clusions  of  experimental  studies  and  practical  experience.  If  this  proves  to  be  the  case  for 
parallel  machine  scheduling,  then  the  design  and  analysis  of  a  model  that  is  less  pessimistic 
than  the  worst-case  competitive  ratio  but  has  more  structure  than  expected  performance  on 
randomly-selected  instances  might  be  a  valuable  endeavor. 


Chapter  4 


Parallel  Network  Optimization:  An 
Introduction 


In  the  last  decade  there  has  been  much  interest  in  harnessing  the  power  of  parallel  computers  to 
solve  large  complex  problems  in  real  time.  The  first  step  in  any  such  effort  must  be  to  understand 
how  efficiently  the  most  basic  problems  can  be  solved  by  parallel  computers,  and  to  then 
construct  more  complex  systems  out  of  these  building  blocks.  Accordingly  both  theoreticians 
and  practitioners  have  put  much  effort  into  the  study  of  parallel  algorithms  for  a  large  variety 
of  basic  problems. 

In  the  field  of  combinatorial  optimization  one  can  hardly  find  more  basic  problems  than  the 
matching,  maximum  flow  and  minimum-cost  flow  problems.  Many  of  the  important  concepts 
that  arose  out  of  the  study  of  these  problems  -  augmenting  paths,  scaling,  relaxed  optimality, 
strong  polynomiality  -  have  been  used  widely  in  other  areas  of  combinatorial  optimization. 
Each  has  a  large  number  of  important  applications  in  and  of  itself;  furthermore  these  problems 
serve  as  building  blocks  for  a  wide  variety  of  more  complex  applications,  such  as  the  traveling 
salesman  problem  [80],  cyclic  staff  scheduling,  machine  scheduling  [83]  and  vehicle  and  crew 
routing  [16].  We  therefore  expect  that  the  study  of  parallel  algorithms  for  these  problems  will 
yield  important  insights  into  the  theory  of  parallel  computation  while  also  providing  useful 
building  blocks  for  the  parallel  solution  of  harder  problems. 

In  the  next  three  chapters  of  this  thesis  we  consider  several  practical  and  theoretical  issues 
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about  parallel  algorithms  for  these  problems.  In  this  chapter  we  introduce  this  topic  by  formally 
defining  the  problems  and  our  theoretical  model  of  parallel  computation,  and  by  discussing 
previous  theoretical  work  on  parallel  algorithms  for  these  problems. 

4.1  Definitions  and  Models 

In  the  matching  problem  we  are  given  an  undirected  graph  G  =  ( V,E ),  possibly  with  an 
associated  weight  function  w  :  E  -*■  Z+  U  {0}.  Let  u;max  =  maxe€£;  w(e),n  =  \V\  and  m  =  |2?|. 
Given  S  C  E,  define  deg5(v)  to  be  the  degree  of  v  in  the  graph  (F,  S ).  A  matching  in  a 
graph  G  is  a  set  of  edges  MCE  such  that  no  two  edges  in  M  share  a  common  endpoint;  i.e., 
degM(u)  <  1,  Vv  S  F.  A  perfect  matching  is  a  matching  of  cardinality  A  minimum  weight 
perfect  matching  is  a  perfect  matching  M  that  minimizes  £e6Af  w(e).  If  G  is  bipartite  then 
a  perfect  matching  is  known  as  an  assignment.  In  various  forms  of  the  problem  one  is  asked 
to  produce,  for  example,  a  maximum  cardinality  matching,  a  maximum-weight  matching  or  a 
minimum-weight  perfect  matching. 

In  the  maximum-flow  problem  we  are  given  a  flow  network  G  =  ( F ,  E ),  which  is  a  directed 
graph  with  two  distinguished  vertices,  s  and  t ,  where  s  is  called  the  source  and  t  the  sink.  With 
every  edge  e  =  (ij)  of  a  flow  network  is  associated  a  capacity  u(i,j)  >  0.  A  flow  is  a  real 
valued  function  /  :  E  — ►  R+  U  {0}  that  satisfies  the  following  two  constraints: 

Capacity  Constraint:  For  all  i,j  €  V,  we  require  f(i,j)  <  u(i,j). 

Flow  conservation:  For  all  v  6  V,  v  £  {s,f}, 

£  /(*»=  £  /(®»i) 

■€v  jev 

0.«)€e  <..>)€£ 

The  value  of  a  flow  is  defined  as 

£  /(»\0; 

(>.0€E 

a  maximum  flow  is  simply  a  flow  of  maximum  value. 

The  minimum-cost  flow  problem  is  the  weighted  generalization  of  the  maximum-flow  prob¬ 
lem.  We  assign  a  cost  function  c  :  E  -+  R  to  the  edges  of  G\  the  cost  of  a  flow  /  is  defined  as 
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the  sura 

]£  fihjWhj)- 

(ij)eE 

We  can  then  ask  for  the  “minimum-cost  flow”  or  the  “minimum-cost  maximum  flow.” 

Models  of  Parallel  Computation 

Our  theoretical  model  of  parallel  computation  will  be  the  CRCW  PRAM  [73].  A  PRAM  • 
consists  of  a  number  of  sequential  independent  processors,  each  with  its  own  private  memory, 
communicating  with  one  another  through  a  global  memory.  In  one  unit  of  time,  each  processor 
can  read  one  global  or  local  memory  location,  execute  a  single  RAM  operation,  and  write 
into  one  global  or  local  memory  location.  In  the  CRCW  PRAM  we  allow  concurrent  reads  of 
the  same  memory  location,  and  concurrent  writes  to  the  same  location;  conflicts  in  writes  are 
resolved  by  a  a  priority  mechanism:  the  write  of  the  processor  with  the  lowest  number  succeeds. 
We  mention  the  details  of  the  PRAM  here  for  the  sake  of  completeness;  when  we  describe  our 
parallel  algorithms  in  chapter  6  we  will  do  so  at  a  fairly  high  level  of  detail. 

The  complexity  classes  which  correspond  to  our  notion  of  easy  to  parallelize  are  A fC  and 
TltfC.  A fC  is  the  class  of  decision  problems  for  which  there  exist  algorithms  that  run  in  time 
0(logfc  n)  on  a  CRCW  PRAM  with  0(ne)  processors,  where  c  and  k  are  constants  and  n  is 
the  size  of  the  input.  1ZA fC  is  the  corresponding  class  of  decision  problems  with  randomized 
algorithms  that  run  in  time  O(logfc  n )  on  a  CRCW  PRAM  with  0(nc)  processors  and  produce 
the  correct  answer  with  probability  at  least  §. 

By  a  decision  problem  we  refer  to  a  set  of  instances  of  a  problem,  all  of  whom  satisfy  some 
property.  For  example,  the  perfect  matching  decision  problem  would  be  the  set  of  all  graphs 
G  that  contain  a  perfect  matching;  the  decision  question  would  be  “Does  G  contain  a  perfect 
matching?”  Another  example  is  the  maximum-flow  problem;  the  decision  problem  might  be  the 
set  of  all  flow  networks  N  for  whom  the  value  of  the  maximum  flow  is  odd;  the  decision  question 
would  be  “Is  the  value  of  the  maximum  flow  in  the  flow  network  N  odd?”  The  optimization 
versions  of  these  problems  would  not  ask  for  a  True/ False  answer,  but  rather  a  value,  such  as 
the  size  of  the  maximum  matching  or  the  value  of  the  maximum  flow.  We  will  at  times  blur 
the  distinction  between  the  two.  For  example,  in  chapter  6  we  give  randomized  algorithms  to 
actually  find  minimum  (unary)  weight  matchings.  Since  these  algorithms  run  in  time  0(logi  n) 
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on  a  CRCW  PRAM  with  0(ne )  processors,  where  c  and  k  are  constants  and  n  is  the  size  of  the 
input,  we  will  say  the  problem  is  in  TlAfC ;  it  is  clear  how  to  use  the  optimization  algorithm  as 
a  decision  algorithm.  For  a  fuller  discussion  of  the  classes  AfC  or  7 ZAfC,  see  the  survey  of  Karp 
and  Ramachandran  [73]. 

The  complexity  class  that  corresponds  to  our  notion  of  difficult  to  parallelize  is  the  class  of 
P- Complete  decision  problems  [73].  Analogous  to  the  ATP- Complete  problems  in  their  role  as 
the  “hardest”  problems  in  the  class  V ,  no  MC  or  TZAfC  algorithm  for  any  7^- Complete  problem 
exists  unless  V  is  equal  to,  respectively,  MC  or  RAfC. 

4.2  Previous  Work 

The  parallel  computational  complexity  of  the  matching  problem  is  one  of  the  most  interesting 
open  questions  in  the  theory  of  parallel  computation  today.  The  first  major  progress  was  made 
by  Karp,  Upfal  and  Wigderson,  who  gave  an  7 ZMC  algorithm  to  find  a  maximum  matching  with 
an  O(log3n)  worst-case  time  bound  in  the  PRAM  model  [70].  Mulmuley,  Vazirani  and  Vazirani  , 
improved  this  result  to  an  RNC  algorithm  with  an  0(log2  n)  worst  case  time  bound.  These 
algorithms  were  Monte  Carlo  randomized  algorithms,  meaning  that  with  high  probability  they 
produced  a  correct  answer,  but  could  not  indicate  when  they  produced  an  incorrect  answer.  A 
desirable  goal  is  a  Las  Vegas  algorithm:  an  algorithm  that  with  high  probability  produces  a 
correct  solution  and  otherwise  outputs  the  word  FAILURE.  In  this  case  one  can  try  again  and 
eventually  arrive  at  the  optimal  solution. 

Karloff  [69]  gave  a  Las  Vegas  7ZJVC  algorithm  for  the  maximum  matching  problem  by 
utilizing  a  “min-max”  duality  relation,  the  Tutte-Berge  formula  [104],  that  characterizes  the 
size  of  a  maximum  matching.  Karloff  showed  that  the  minimum  side  of  the  relation  could 
be  calculated  in  TlMC;  if  this  is  equal  to  the  size  of  the  candidate  maximum  matching  the 
Tutte-Berge  formula  guarantees  that  they  are  both  optimal. 

The  results  of  Karp,  Upfal  and  Wigderson  and  of  Mulmuley,  Vazirani  and  Vazirani  both 
yielded  algorithms  for  the  weighted  versions  of  the  problem  as  well.  Since  the  number  of 
processors  is  proportional  to  n3  5rou>max,  where  u;max  is  the  maximum  edge  weight,  these  are 
only  7 ZAfC  algorithms  if  the  edge  weights  are  input  in  unary.  No  UNC  algorithm  is  known 
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for  the  maximum  weight  matching  problem  or  the  minimum  weight  perfect  matching  problem 
when  the  weights  are  input  in  binary;  nor  are  these  problems  known  to  be  P-Complete.  Further, 
no  Monte  Carlo  1UVC  algorithms  for  unary  weighted  matching  problems  were  known  until  our 
work. 

Although  no  deterministic  JVC  algorithms  are  known  for  matching  problems,  some  progress 
has  been  made  on  deterministic  sublinear  time  parallel  algorithms.  We  say  that  an  algorithm 
runs  in  0*(/(n))  time  if  it  runs  in  0(f(n)  log4  n)  time  for  some  constant  k.  Goldberg,  Plotkin 
and  Vaidya  [46]  gave  a  parallel  algorithm  to  find  a  maximum  matching  in  a  bipartite  graph  that 
ran  in  0(n$)  time  using  BFS(n,m )  processors,  and  a  parallel  algorithm  to  find  a  minimum- 
weight  assignment  that  ran  in  0'(n$  log (nC))  time  using  SSP(n,  m)  processors.  Here  C  —  wmax 
is  the  maximum  edge  cost,  BFS(n,  m)  denotes  the  maximum  of  n  +  m  and  the  number  of 
processors  required  to  find  a  breadth-first  search  tree  in  0(log2  n)  time,  and  SSP(n,  m)  denotes 
the  maximum  of  n  +  m  and  the  number  of  processors  required  to  find  a  single-source  shortest- 
path  tree  (with  non-negative  weights)  in  0(log2  n)  time. 

Goldberg,  Plotkin,  Shmoys  and  Tardos  [48]  used  interior  point  methods  to  give  an  0*(v/m  log  C) 
time  parallel  algorithm  for  the  assignment  problem  and  an  0‘(y/m)  time  algorithm  for  the  un¬ 
weighted  version.  Both  of  these  algorithms  used  0(m3)  processors. 

Besides  the  running  time,  another  way  to  measure  the  performance  of  a  parallel  algorithm 
is  the  total  work  performed,  where  we  define  work  as  the  product  of  processors  x  time.  The 
best  results  for  bipartite  matching  and  the  assignment  problem  in  this  regard  are  those  of 
Gabow  and  Tarjan,  who  for  a  large  range  of  numbers  of  processors  p  give  parallel  algorithms 
for  p  processors  with  total  work  within  a  factor  of  O(logp)  of  the  work  of  the  best  sequential 
algorithm  [43]. 

As  a  consequence  of  the  TVVC  matching  algorithms,  the  maximum-flow  problem  with  the 
edge  capacities  input  in  unary  is  also  in  TIJVC.  The  maximum-flow  problem  and  the  minimum- 
cost  flow  problem  are  known  to  be  ^-Complete,  however,  when  the  edge  capacities  are  input 
in  binary  [49].  Therefore  it  is  believed  that  there  exist  no  JVC  or  71  JVC  algorithms  for  these 
problems.  Nonetheless  parallel  computation  can  yield  speedups  in  the  running  time;  for  example 
the  Goldberg-Tarjan  maximum-flow  algorithm  can  be  implemented  so  as  to  give  an  0(n2  log  n) 
time,  0(n )  processor  algorithm  [45].  For  the  minimum-cost  flow  problem  they  give  a  parallel 
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algorithm  that  takes  0(n2(log  n)  log(nC))  time  on  0(n)  processors  [45]. 

In  the  next  three  chapters  we  present  three  results  about  parallel  algorithms  for  these  prob¬ 
lems.  In  Chapter  5  we  give  a  simple  but  interesting  separation  between  the  parallel  complexity 
of  the  maximum-flow  problem  and  the  minimum-cost  maximum  flow  problem,  showing  that 
the  minimum-cost  maximum  flow  problem  can  not  be  approximated  in  MC  or  7 ZMC  unless  V 
is  equal  to,  respectively,  MC  or  1ZMC.  In  contrast,  it  is  known  that  the  maximum-flow  prob¬ 
lem  can  be  approximated  quite  closely  in  IlMC,  and  that  by  a  very  recent  result  of  Fischer, 
Goldberg  and  Plotkin,  that  the  maximum  matching  problem  can  be  approximated  arbitrarily 
closely  in  MC  [42]. 

In  Chapters  6  and  7  we  consider  both  theoretical  and  experimental  issues  in  the  parallel 
solution  of  weighted  matching  problems.  In  chapter  6  we  give  Las  Vegas  V.MC  algorithms  to 
find  a  minimum-weight  perfect  matching  when  the  edge  weights  are  input  in  unary.  We  also 
show  how  to  apply  the  technique  to  a  number  of  other  problems  as  weli. 

In  Chapter  7  we  describe  an  experimental  study  of  various  parallel  algorithms  for  the  as¬ 
signment  problem  on  a  real  massively  parallel  computer,  the  Connection  Machine  CM-21.  We 
consider  in  detail  one  special  case,  that  of  the  fully  dense  assignment  problem,  where  the  graph 
is  a  complete  graph.  We  implemented  a  number  of  different  algorithms  for  this  problem,  and 
developed  a  new  hybrid  approach  similar  to  the  algorithm  of  Goldberg,  Plotkin  and  Vaidya. 
This  proved  to  be  the  most  successful  by  a  considerable  margin.  We  also  discuss  the  viability 
of  the  other  major  approaches  which  we  did  not  implement. 


'This  is  a  trademark  of  Thinking  Machines  Corporation. 


Chapter  5 


The  Parallel  Approximability  of  a 
Flow  Problem1 


Once  a  problem  is  proved  to  be  P-complete,  it  is  generally  believed  that  there  exists  no  AfC 
or  IZAfC  algorithm  to  solve  it  exactly2.  Therefore,  the  next  important  question  becomes  how 
well  can  it  be  approximated  in  AfC  or  US/C!  In  this  chapter  we  show  that  despite  the  fact  that 
one  can  approximate  the  value  of  a  maximum  flow  arbitrarily  closely  in  IZAfC ,  approximating 
the  value  of  the  minimum-cost  maximum  flow,  within  a  factor  of  the  maximum  cost  in  the 
network,  is  P-Complete.  Our  proof  also  shows  that  this  is  true  for  networks  with  maximum 
cost  polynomial  in  the  size  of  the  network.  The  chapter  consists  of  two  short  sections.  In 
Section  5.1  we  discuss  previous  work  on  the  ATNapproximability  of  P-complete  problems  and 
flow  problems  in  particular.  In  Section  5.2  we  prove  our  result. 


5.1  Background 


There  has  been  some  amount  of  work  on  A’C-approximation  algorithms  for  ^-Complete  prob¬ 
lems.  For  example,  Anderson  and  Mayr  considered  the  High  Degree  Subgraph  Problem:  Given 
a  graph  G  and  an  integer  k,  find  the  maximum  induced  subgraph  of  G  that  has  all  nodes  of  de- 


1This  chapter  describes  joint  work  with  Cliff  Stein  [117]. 

2Despite  the  fact  that  P-Completeness  is  usually  defined  in  terms  of  decision  problems,  in  this  chapter  we 
will  often  refer  to  the  optimization  versions  as  well.  This  has  no  effect  on  our  results. 
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gree  at  least  k.  They  prove  that  this  problem  is  ^-Complete,  and  further  that  it  is  ^-Complete 
to  approximate  within  any  factor  better  than  ^  [2].  In  other  words,  it  is  V-  Complete  to  produce 
a  subgraph  that  is  of  size  greater  than  |  of  the  maximum  induced  subgraph  with  the  appro¬ 
priate  connectivity  constraints.  In  contrast  to  this  result  they  give  an  A fC  algorithm  that  can 
approximate  the  solution  within  a  factor  arbitrarily  close  to  Subsequently  Hochbaum  and 
Shmoys  showed  how  to  approximate  it  within  a  factor  of  exactly  |  [59].  A  similar  result  was 
obtained  by  Kirousis,  Serna  and  Spirakis  [76],  who  investigated  the  High  Edge-Connectivity 
Subgraph  Problem,  and  showed  it  could  be  approximated  in  A fC  within  any  factor  <  but 
producing  a  better  approximation  was  P-Complete.  They  also  demonstrated  the  same  type  of 
behavior  for  the  vertex-connectivity  version  of  the  problem. 

There  have  also  been  several  results  that  show  that  a  certain  P-complete  problem  can  not 
be  approximated  in  AfC  within  any  factor  unless  V  =  AfC  [77,  107].  The  most  interesting  of 
these  is  a  recent  result  by  Serna,  who  proved  the  V- Completeness  of  approximating  Linear 
Programming  within  any  factor  [106].  This  raises  the  question  of  whether  the  same  is  true 
for  other  P-Complete  problems,  such  as  maximum  flow  or  minimum-cost  flow,  that  can  be 
described  as  combinatorial  linear  programs. 

It  is  unlikely  that  such  a  result  is  true  for  the  maximum-flow  problem,  since  it  is  known 
that  it  can  be  approximated  arbitrarily  closely  in  TZAfC.  This  is  a  consequence  of  the  unary 
capacity  version  of  the  problem  being  in  TZ.VC.  A  binary  capacity  problem  instance  $  can  be 
approximated  within  a  factor  of  (1  +  pdlynomiai~(n))  ^  us*n§  the  unary  capacity  algorithm 
on  a  scaled  version  of  'f  whose  capacities  are  only  the  high  order  0(log  n)  bits  of  the  binary 
capacities.  It  is  also  true  that  the  minimum-cost  maximum  flow  problem  is  in  TZAfC  when  both 
the  capacities  and  the  costs  are  in  unary.  Therefore  one  might  imagine  that  it  might  be  possible 
to  approximate  the  minimum-cost  maximum  flow  problem  with  binary  capacities  and  unary 
costs  in  TZAfC ;  we  prove  that  this  is  not  the  case  unless  V  =  7 ZjVC. 


5.2  Proof 

Definition  5.2.1  AMCF(p)  is  the  problem  of  approximating  the  value  of  the  minimum-cost 
maximum  flow  in  a  network  to  within  a  factor  of  p,  where  the  capacities  are  expressed  in  binary  and 
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the  costs  are  expressed  in  unary. 

Theorem  5.2.2  AMCF(p)  is  log-space  complete  for  V. 

We  will  prove  this  theorem  by  reducing  a  form  of  the  monotone  circuit  value  problem 
(MCV 2)  to  AMCF(p).  The  reduction  is  a  simple  generalization  of  the  proof  of  Goldschlager, 
Shaw  and  Staples  [49]  that  the  problem  of  determining  the  exact  value  of  the  maximum  flow 
in  a  network  is  log-space  complete  for  V.  Before  we  begin  the  proof  we  give  several  necessary 
definitions  and  known  results. 

Definition  5.2.3  A  monotone  circuit  a  is  a  sequence  (cv,,, . . . ,  o-o)  where  each  a,-  is  either  an 
input,  in  which  case  its  value  of  either  0  or  1  is  given  explicitly,  or  a  gate.  A  gate  a,-  is  either  an 
AND  gate  AND (j,k)  or  an  OR  gate.  OR (j,k)  where  j  >  k  >  i.  The  fan-out  of  a  gate  ctj  is  the 
number  of  gates  a*,  k  <  j,  to  which  ctj  is  an  input. 

Definition  5.2.4  MCV 2  is  the  problem  of  determining  the  value  of  a  monotone  circuit  such 
that  each  input  has  fan-out  at  most  one,  each  gate  has  fan-out  at  most  2,  and  the  last  gate  is  an 
OR  gate. 

Theorem  5.2.5  [49]  MCV 2  is  log-space  complete  for  V. 

We  are  now  ready  to  begin  the  proof  of  our  theorem. 

Proof  of  Theorem  5.2.2:  Let  A  =  (a„, ...,£*!)  be  an  instance  of  MCV2,  and  let  d;  be  the 
fan-out  of  gate  (or  input)  a{.  We  will  demonstrate  a  log-space  transformation  of  A  into  a  flow 
network  GA  =  (V,  E ).  The  vertices  of  GA  are  s,  t,  and  i,  0  <  i  <  n.  The  edges  of  GA,  and  their 
capacities  and  costs,  are  as  follows. 

Type  1  Cost  0  edges: 

•  For  each  input  a,-  include  an  edge  (s,i)  of  capacity  0  if  a,-  is  false,  or  of  capacity  2‘ 
if  a,-  is  true.  Also  include  an  edge  (i,s)  of  capacity  2‘. 

•  For  every  OR  gate  a,-  =  OR (j,k)  include  an  edge  (j,  i)  of  capacity  2;  ,  (k,  i)  of 
capacity  2*,  and  (i,s)  of  capacity  2;  +  2*  -  d{ 2*. 
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•  For  every  AND  gate  a,-  =  AND(aJ-,  a*),  include  an  edge  vjf,  t)  with  capacity  2‘,  an 
edge  ( k,i )  with  capacity  2*,  and  an  edge  (i,t)  with  capacity  2J  +  2*  -  d,-2\ 

Type  2  An  edge  (0,t)  with  capacity  1  and  cost  p. 

Type  3  An  edge  ( s,t )  with  capacity  1  and  cost  1. 

Goldschlager,  Shaw  and  Staples  showed  that  in  the  flow  network  composed  of  the  edges 
of  type  1  and  2,  the  maximum  flow  value  is  odd  if  and  only  if  the  circuit  A  outputs  true. 
Furthermore  the  maximum  flow  value  is  odd  if  and  only  the  edge  (0,  i)  has  one  unit  of  flow. 
Therefore  to  determine  the  output  of  circuit  A  we  merely  have  to  determine  the  flow  on  edge 
(0,  t )  in  a  maximum  flow.  VVe  will  show  how  to  prove  this  momentarily.  Assuming  for  the 
moment  that  it  is  true,  we  first  complete  the  proof  of  our  theorem. 

Notice  that  given  our  cost  assignment,  the-cost.of  a  maximum  flow  is  p  -f  1  if  there  is  one 
unit  of  flow  on  edge  (0,  t )  and  is  otherwise  1.  The  edge  (s,  t)  will  be  have  ore  unit  of  flow  in  any 
maximum  flow  and  this  edge  will  therefore  contribute  a  cost  of  1.  Edge  (0 ,t)  will  contribute  p 
units  to  the  cost  if  edge  (0,  t)  has  one  unit  of  flow  on  it. 

Therefore  if  we  could  approximate  the  minimum-cost  maximum  flow  problem  within  a 
factor  of  p  we  could  determine  whether  the  value  for  this  network  was  1  or  p  thus  determine 
the  parity  of  the  maximum  flow  in  the  underlying  network  which  gives  the  output  of  circuit  A. 
This  reduction  is  certainly  in  logspace  as  long  as  p  is  polynomial  in  the  size  of  the  circuit  n.  If 
we  wish  to,;allow  edge  costs  to  expressed  in  binary,  p  can  be  exponential  in  the  size  of  the 
network. 

We  will  now  explain  the  proof  that  circuit  A  outputs  true  if  and  only  if  the  value  of  the 
maximum  flow  in  the  flow  network  that  contains  the  edges  of  type  (1)  and  (2)  from  Ga •  Call 
that  network  Ga>*  We  will  exhibit  a  flow  /  in  Ga>  and  then  prove  that  it  is  a  maximum  flow. 
We  denote  the  flow  on  edge  ( x,y )  by  f(x,y)  and  the  capacity  of  edge  (x,y)  by  ti(x,y). 

In  flow  /,  for  0  <  i  <  n, 

•  For  a,-  an  input  of  circuit  A,  f(s,i )  =  u(s,i).  If  a,-  is  not  an  input  to  any  other  gate, 
/(m)  =  u(i,s);  otherwise  it  is  zero. 

•  For  0  <  j  <  n,  f(i,j)  =  2*  if  gate  a,-  outputs  true,  otherwise  /  ( i,j )  =?  0. 


5.2.  PROOF 


77 


•  If  a,-  =  AND (j,k)  then  /(M)  =  u(i,t)  if  both  ctj  and  a*  output  true.  Otherwise  f(i,  t)  = 
f(j,  i )  +  f(k,  i ).  The  intuition  here  is  that  the  node  representing  a,-,  in  a  maximum  flow, 
will  only  have  d,-2‘  units  of  flow  to  send  from  i  if  it  receives  2j  and  2*  as  inputs.  Since 
this  is  ?»  maximum  flow,  all  flow  will  be  directed  onto  (t,  t)  until  that  arc  is  saturated. 
Only  if  both  gates  a*  and  a;-  output  true  will  there  be  flow  “left-over”  after  (i,  t)  has  been 
saturated. 

•  If  a<  =  OR (j,k)  then  f(i,s)  =  f(j,i)  +  d( 2‘  if  either  ctj  or  a*  outputs  true.  The 

intuition  here  is  that  in  a  maximum  flow  flow  will  go  anywhere  before  going  back  to  s,  so 
if  any  flow  is  input  from  j  or  k  it  will  become  the  output  of  i  before  returning  to  s. 

•  /(0,<)  =  if  a0  computes  true;  otherwise  /(0,t)  =  0. 

The  parity  of  the  value  of  /  is  odd  if  and  only  if  the  circuit  A  outputs  true.  This  is  because 
/  assigns  an  even  amount  of  flow  to  every  edge  except  perhaps  (0,<),  which  is  1  if  and  only  if  A 
outputs  true.  It  remains  to  prove  that  /  is  a  maximum  flow.  This  can  be  proved  in  the  standard 
way,  by  showing  there  is  no  augmenting  path.  If  there  is  an  augmenting  path  from  s  to  t  then 
the  first  edge  must  be  a  back  edge,  since  each  forward  edge  (s,  i)  out  of  s  has  f(s,i)  =  u(s,i). 
It  must  end  with  a  forward  edge  since  there  are  no  edges  out  of  ti  Therefore  somewhere  on 
the  path  there  must  be  a  back  edge  followed  by  a  forward  edge.  A  simple  case  analysis  shows 
however  that  this  is  impossible.  ■ 


Chapter  6 


Las  Vegas  UK fC  Algorithms 


6.1  Introduction 

In  this  chapter  we  present  a  Las  Vegas  TZjVC  algorithm  for  the  problem  of  finding  a  minimum- 
weight  perfect  matching  in  a  graph  when  the  weights  of  the  edges  are  input  in  unary.  We 
utilize  a  transformation  of  minimum-weight  perfect  matching  to  the  T-join  problem,  and  use  the 
structure  theory  of  T-joins  as  developed  by  Sebo  [105]  to  develop  an  optimality  condition  that 
can  be  computed  in  7 IKfC.  Easy  consequences  of  this  result  are  Las  Vegas  7 ZN'C  algorithms  for 
finding  a  maximum  weight  matching,  a  half-integral  planar  multicommodity  flow,  a  minimum 
weight  T-join.  a  maximum  1-packing  of  T-cuts  in  a  bipartite  graph  and  the  T-join  structure  of 
a  graph. 


6.2  Background 

6.2.1  T-joins 

We  are  given  an  undirected  graph  G  =  (F,  E)  with  an  associated  weight  function  w  :  E  -*■ 
Z+  U  {0}.  Let  T  C  V  with  |T|  even.  A  set  F  C  E  is  called  a  T-join  if  degf  (x)  =  0(mod  2) 
for  x  £  T,  and  degf(x)  =  l(mod  2)  for  x  6  T.  A  minimum  T-join  is  a  T-join  of  minimum 
cardinality.  We  define  r(f?,T),  |T|  even,  to  be  the  size  of  the  minimum  T-join  for  G.  A 
minimum  weight  T-join  F  is  a  T-join  which  minimizes  w(e)- 
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T-joins  can  be  viewed  as  generalizations  of  matchings,  Chinese  postman  tours,  and  paths. 
For  example,  an  s  to  t  path  is  a  T-join  with  T  =  {s,t}  and  a  perfect  matching  M  is  a  V- 
join.  Or,  consider  the  Chinese  postman  problem:  find  a  minimum  length  tour  that  traverses 
every  edge  of  a  graph  at  least  once.  In  an  Eulerian  graph  this  is  just  an  Eulerian  tour;  in  a 
general  graph  there  is  a  clear  one-to-one  correspondence  between  minimum  postman  tours  and 
minimum  T-joins,  where  T  is  the  set  of  all  odd  degree  nodes  in  G  [105]. 

The  following  well-known  facts  will  be  important;  therefore,  we  present  their  proofs  here. 

Fact  8.2.1  [86,  105]  The  problem  of  finding  a  minimum  weight  perfect  matching  in  an  arbitrary 
graph  whose  edge  weights  are  input  in  unary  can  be  reduced  in  jVC  to  finding  a  minimum  cardinality 
T-join  in  a  bipartite  graph. 

Fact  0.2.2  [86]  The  problem  of  finding  a  minimum  cardinality  T-join  can  be  reduced  in  jVC  to 
finding  a  minimum  weight  perfect  matching  in  a  graph  whose  edge  weights  are  represented  in  unary. 

Proof  of  Fact  0.2.1 

We  first  reduce  finding  a  minimum  weight  perfect  matching  to  finding  a  minimum  weight 
T-join  in  an  arbitrary  graph.  Let  T  =  V,  w(e)  =  w(e)  +  nw max,  and  let  G  be  G  with  edge 
weights  t v.  Note  that  w(e)  >  0  Ve.  If  G  has  a  perfect  matching,  it  has  a  T-join  of  weight  not 
exceeding  f  (t<W  +  «wmax).  Any  other  T-join  has  weight  exceeding  (|  +  l)nu;max,  so  if  G  has 
a  perfect  matching,  the  minimum  weight  T-join  in  G  is  a  minimum  weight  perfect  matching  in 
G. 

Finding  a  minimum  weight  T-join  in  a  graph  G  can  be  reduced  to  finding  a  minimum 
cardinality  T-join  in  a  graph  G\  where  we'replace  an  edge  e  =  pq  of  weight  w(e)  by  a  path 
pvtv2  •  “Vw(e)^iq.  None  of  the  v,-  are  included  in  T.  It  is  easy  to  see  that  there  is  a  one-to-one 
correspondence  between  minimum  cardinality  T-joins  in  G‘  and  minimum  weight  T-joins  in  G. 

Given  a  non-bipartite  graph  G,  we  can  construct  a  bipartite  graph  G'  whose  T-joins  easily 
correspond  to  those  of  G,  by  inserting  a  vertex  ve  in  the  middle  of  each  edge  e  and  not  including 
ve  in  TM 

Proof  of  Fact  8.2.2: 

Compute  the  shortest  paths  between  each  pair  of  vertices  in  T;  it  is  well  known  this  can 
be  done  in  JVC  [73].  Now  consider  the  complete  graph  with  vertex  set  T  where  the  weight  of 
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edge  ij  is  the  length  of  the  shortest  path  from  i  to  j  in  G.  If  we  find  a  minimum  weight  perfect 
matching  AT  in  this  graph,  the  set  of  edges  of  G  that  correspond  to  the  paths  represented  by 
the  edges  of  M  form  a  minimum  cardinality  T-join.  Note  that  if  G  has  n  vertices  the  maximum 
edge  weight  arising  from  this  shortest-paths  computation  is  n  and  therefore  can  be  represented 
in  unary  while  increasing  the  size  of  the  problem  by  a  most  a  polynomial  in  its  original  size; 
therefore  the  reduction  can  be  carried  out  in  AfC.  ■ 

6.2.2  T-cuts 

In  this  section  we  describe  a  dual  to  the  T-join,  and  min-max  relations  that  will  allow  us  to 
certify  when  both  are  optimal. 

For  A'  C  V  let  S(X)  =  {xy  €  E  :  x  e  X,y  £  X}.  The  subset  K  CE  is  a  cut  if  I<  =  S(X) 
for  some  X  C  V.  If  |Tfl  A|  =  l(mod  2),  then  tf(A')  is  called  a  T-cut. 

A  k-packing  ofT-cuts ,  where  k  is  a  positive  integer,  is  a  multiset  $  of  T-cuts  such  that  each 
edge  in  the  graph  appears  in  no  more  than  k  of  the  cuts.  We  define 
vk  =  uk(G,T)  =  maxfl'Pj  :  ^  is  a  fc-packing  of  T-cuts}. 

It  is  not  hard  to  see  that  r  >  >  V\.  The  following  minimax  theorems  are  known. 

Theorem  6.2.3  [85]  Let  G  be  a  graph;  then  r(G,T)  =  . 

Theorem  6.2.4  [111]  Let  G  be  a  bipartite  graph;  then  t(G,T)  =  u{(G,T). 

6.2.3  Sketch  of  Algorithm 

Theorem  6.2.4  implies  that  if  we  are  given  a  T-join  F  and  a  1-packing  of  T-cuts  Q  in  a  bipartite 
graph  such  that  |Tj  =  |Q|,  then  F  is  a  minimum  T-join  and  Q  is  a  maximum  1-packing  of  T- 
cuts.  It  is  this  property  we  use  in  our  algorithm,  which  we  now  sketch. 

1.  Given  G ,  use  the  reduction  of  Fact  6.2.1  to  reduce  the  problem  of  finding  a  minimum 
weight  perfect  matching  in  G  to  that  of  finding  a  minimum  cardinality  T-join  in  bipartite 
G.  Then  use  Fact  6.2.2  and  the  randomized  matching  algorithm  of  Mulumuley,  Vazirani 
and  Vazirani  to  find  a  minimum  unweighted  T-join  F  in  G.  With  high  probability  F  is  a 
minimum  cardinality  T-join  in  (5,  but  with  low  probability  it  is  not.  (We  actually  have  to 
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test  that  it  is  a  T-join  at  all,  since  if  the  algorithm  fails  the  result  need  not  be  a  T-join. 
We  ignore  this  detail  for  the  rest  of  the  chapter.)  Call  F  the  candidate  minimum  T-join. 

2.  Calculate  i /j(G,T)  in  RAfC.  How  this  is  done  will  be  discussed  below,  but  the  important 
point  is  that  if  is  the  returned  value  with  high  probability  it  is  indeed  (G,T)  and 
with  low  probability  it  is  not.  Call  the  packing  that  is  found  the  candidate  maximum 
packing. 

3.  If  |F|  is  the  size  of  our  candidate  T-join,  and  V\  is  the  size  of  our  candidate  1-packing 
of  T-cuts  and  |F|  =  V\ ,  then  each  was  calculated  correctly  and  is  optimal.  Reverse  the 
reductions  to  produce  a  certified  minimum-weight  perfect  matching. 

We  need  to  explain  how  to  do  Step  2  in  7 ZAfC.  In  order  to  calculate  vy{G,T),  we  need  to 
understand  the  structure  theory  of  Sebo. 


6.3  A  Structure  Theorem  of  Sebo 


In  this  section  we  will  present  a  theorem  of  Sebo  that  will  enable  us  to  construct  a  maximum 
1-packing  of  T-cuts  in  a  bipartite  graph  G  =  (F,  E )  by  calculating  |  V|  different  T' -joins.  Given 
a  graph  G  and  a  set  T  C  V,  a  necessary  condition  for  G  to  possess  a  T-join  is  that  |T|  be  even. 
For  all  x  €  V,  define  the  set  Sx  C  V  as  follows 


Sx 


S  U  {x}  if  x  $  S 
S  -  {x}  if  x  €  S. 


Note  that  when  S  C  V,  |S|  odd,  the  Sz  are  all  of  even  cardinality.  A  set  of  minimum 
cardinality  Sr-joins  for  all  x  €  V  is  called  the  S  —  join  structure  of  the  graph.  It  is  such  a  set 
of  Sr-joins  we  will  use  to  construct  a  maximum  1-packing  of  T-cuts. 


Theorem  6.3.1  [105]  Let  G  =  (V,  E )  be  a  bipartite  graph,  S  C  V,  |S|  odd.  For  each  node  x  €  V 
let  Fx  be  a  Sr-join.  For  all  x  6  V  define  7r(x)  =  |F*|.  Let  G:  be  the  graph  induced  by  the  set  of  ver¬ 
tices  x  such  that  x(x)  <  i.  Let  V(ir)  =  {D  :  D  is  the  vertex  set  of  a  connected  component  of  G{  for  some  i}. 

Then  Fx  is  a  minimum  cardinality  ^-join  for  all  x  €  V  if  and  only  if 
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1.  |ff(y)  -  tt(x)|  =  1  for  all  xy  6  E(G). 

2.  For  all  x  €  V  and  D  €  T(7r)> 


|FX  n  6(D)\  = 


1 


if  x  €  D 
if  x  £  D. 


We  will  in  a  moment  explain  part  of  Sebo‘s  proof  of  this  theorem  in  order  to  explain  how 
to  construct  the  packing  of  T-cuts.  We  note  first,  however,  that  merely  the  statement  of  this 
theorem  is  enough  information  to  yield  our  main  result,  a  Las  Vegas  ftA/T-algorithm  for  the 
minimum  (unary)  weight  perfect  matching  problem1.  We  have  already  reduced  this  problem 
to  finding  a  minimum-cardinality  T-join  in  a  bipartite  graph.  A  Las  Vegas  7 ZjVC  algorithm  for 
that  problem  is  as  follows, 


•  If  T  ^  <j>  choose  v  G  T  arbitrarily  and  let  S  =  Tv.  If  T  =  <j>  choose  v  G  V  arbitrarily  and 
let  S  =  {v}.  In  either  case,  note  that  Sv  —  T. 

•  Find  |V|  (candidate)  minimum  5v-joins,  one  for  each  v  G  V,  using  Fact  6.2.2  and  the 
7 INC  weighted  matching  algorithm. 

•  Apply  Theorem  6.3.1  to  verify  that  all  the  Sv  are  simultaneously  optimal. 


If  the  probability  of  failure  in  the  calculation  of  each  St'-join  is  suitably  small  then  this  will 
be  a  Las  Vegas  7iA/’C-algorithm  for  a  minimum  cardinality  T-join  in  a  bipartite  graph.  This 
algorithm  requires  no  explicit  knowledge  of  T-cuts  and  is  therefore  conceptually  simpler,  but 
it  is  no  more  efficient  than  our  main  algorithm,  and  it  also  does  not  yield  the  the  algorithmic 
results  on  packings  of  T-cuts  in  Corollary  6.4.2  nor  the  results  of  Section  6.5.  Therefore  we 
next  give  the  proof  that  conditions  1  and  2  imply  that  each  Fz  is  a  minimum  ST-join  in  order 
to  describe  how  a  1-packing  of  5x-cuts  is  constructed. 

Condition  1  implies  that  no  edge  connects  two  vertices  that  have  the  same  ir  value.  There¬ 
fore,  every  edge  in  E  is  in  some  cut  6(D),  D  G  V(n).  Furthermore,  since  |?r(t/)  -  ff(x)|  =  1, 
xy  is  in  only  one  cut  6(D) ;  therefore,  {£(£>)  :  D  6  V(x)}  is  a  partition  of  E.  This  enables 


JWe  would  like  to  thank  the  anonymous  referee  who  made  this  observation. 
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us  to  express  Fx  as  the  union  of  its  intersection  with  the  members  of  Z>(?r),  and  therefore 
|F'I  =  E06„(„  |f‘  n  t(V)l  =  |{0  6  V(t)  -.xiD)\. 

We  claim  that  if  *  D  then  D  contains  an  odd  number  of  nodes  in  Sx,  and  that  therefore 
6(D)  is  a  5r-cut.  This  is  true  since  a  Sx-join  is  the  union  of  cycles  and  simple  paths  between 
nodes  in  Sx.  Therefore,  if  |tf(.D)  D  Fx\  =  1,  there  must  be  exactly  one  of  these  simple  paths 
that  crosses  the  cut  6(D),  and  no  cycles.  Let  w  £  T  be  the  endpoint  of  that  path  in  D.  Since 
the  degree  of  any  node  v  £  Sx  must  be  odd  in  a  S'-join,  there  must  be  an  even  number  of 
nodes  in  D  besides  w  that  are  in  Sx,  since  they  must  be  paired  up  by  simple  paths  that  do 
not  leave  D.  Thus  there  are  an  odd  number  of  nodes  of  Sx  in  D,  and  6(D)  is  a  S^-cut.  Thus 
the  set  {6(D)\D  £  V,x  D)  is  a  1-packing  of  Sx- cuts,  which  we  showed  earlier  has  cardinality 
|Fr|.  By  Theorem  6.2.4  each  of  the  Fx  is  a  minimum  Sx-join  and  {6(D)\D  £  V,x  g  D)  is  a 
maximum  1-packing  of  5r-cuts.  This  completes  the  proof  that  conditions  1  and  2  imply  that 
each  Fx  is  a  minimum  S'1 -join;  furthermore,  it  shows  how,  from  a  complete  set  of  Fx ,  x  €  V, 
we  can  calculate  Vy(G,Sx)  for  any  x. 

6.4  An  Algorithm  for  Certifying  a  Minimum  Weight  Perfect 
Matching 

The  one  step  of  the  algorithm  from  Section  6.3  still  to  be  explained  is  the  calculation,  with 
high  probability,  of  u{(G,T).  Let  S  =  T  -  {x}  for  some  arbitrary  x  £  T,  and  let  S  be  the  odd 
cardinality  set  referred  to  in  Theorem  6.3.1.  Let  G  have  h  nodes  and  m  edges.  The  parallel 
algorithm  to  calculate  ux(G,T)  is  as  follows: 

1.  Calculate,  with  high  probability,  7r(v)  for  all  vertices  i>  in  G. 

2.  Calculate  the  connected  components  of  &  for  each  i. 

3.  Calculate  \{D  £  D(n)  :  x  £  £>}|. 

If  the  calculation  in  Step  1  succeeded,  by  the  proof  of  Theorem  6.3.1  we  know  that  | D  £ 
D(x) :  x  £  D\  is  the  size  of  the  minimum  cardinality  Sr-join  (T-join)  in  G.  If  this  is  equal  to 
the  size  of  the  candidate  minimum  T-join,  we  know  the  candidate  is  optimal  as  is  our  original 
minimum  weight  perfect  matching. 
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Each  x(u)  can  be  calculated  by  finding  a  minimum  cardinality  Su-join,  which  by  Fact 
6.2.2  can  be  done  by  a  shortest  paths  computation  and  by  finding  a  minimum  weight  prefect 
matching.  Therefore  Step  1  is  in  TZAfC.  Steps  2  and  3  are  well  known  to  be  in  AfC  [73],  so  the 
calculation  of  i/i(G,T)  can  be  carried  out  in  TZAfC. 

Now  consider  the  time  and  processor  requirements  in  a  CRCW  PRAM  model  of  computation 
[73].  In  the  reduction  of  finding  a  minimum  weight  perfect  matching  in  G  to  finding  a  minimum 
cardinality  T-join  in  G  we  increase  the  size  of  the  graph  considerably.  If  G  had  n  nodes,  m 
edges  and  maximum  edge  weight  tumax,  we  increase  the  edge  weights  to  0(nu;max)  and  then 
expand  each  edge  of  weight  w  into  an  unweighted  path  of  length  iv.  Thus  the  number  of  nodes 
in  G ,  n,  is  0(nmtnmax)  and  the  number  of  edges,  rh  is  0(nmwmax)  as  well. 

In  Step  1  we  must  find  n  5u-joins  in  G.  In  order  to  find  each  of  these  0(nmwmax)  Sv- 
joins  we  compute  a  minimum  weight  perfect  matching  in  a  graph  with  n  nodes,  0(n2)  edges, 
and  maximum  edge  weights  of  0(nm  •  tumax)>  This  is  because  |Tj  =  n  both  in  G  and  in 
G,  and  the  length  of  a  path  in  G  is  bounded  by  m.  The  algorithm  of  Mulmuley,  Vazirani 
and  Vazirani  requires  0(N3SMW)  processors  and  0(log2  N)  time  to  find  a  minimum  weight 
perfect  matching  in  a  graph  with  N  nodes,  M  edges  and  maximum  edge  weight  W.  A  factor 
of  NM  in  the  processor  bound  comes  from  scaling  up  the  edge  weights  by  NM  in  order  to 
ensure  that  the  probability  of  success  is  at  least  In  our  reduction  to  the  T-join  problem 
we  have  already  scaled  up  the  edge  weights  by  a  factor  of  n.  Since  we  are  doing  <9(nrnwmax) 
calculations,  however,  we  require  a  smaller  probability  of  failure  for  each  calculation,  specifically 
at  most  o{nmwm\)'  so  overall  probability  of  success  will  be  at  least  |.  Therefore  we 

must  scale  up  the  weights  by  an  additional  factor  of  nmwmax,  and  the  entire  algorithm  will 
require  0(log2n)  time  and  0(nwn2(nm«) max)3)  =  0(n7'5m3u;2ax)  processors. 

This  processor  bound  can  be  reduced  at  the  expense  of  a  logarithmic  factor  in  running  time 
if  we  note  that  the  values  of  ir(v)  for  a  subset  of  the  vertices  of  G  determine  n(v)  for  all  the 
vertices  of  G.  We  will  show  that  <9(logmax(n,  tnmax))  rounds  of  finding  0(m )  5v-joins  will 
suffice  to  determine  all  the  r(v). 

To  prove  this  observe  that  if,  in  the  construction  of  G  from  G,  an  edge  pq  was  expanded 
into  a  path  pviv2...Vkq,  then  a  minimum  S^-join  will  either  include  the  entire  path  from  p 
to  Vi  or  from  u,-  to  q.  Further,  there  is  a  unique  vertex  Vi  on  the  path  such  that  for  j  <  l 
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the  edges  of  the  path  from  p  to  Vj  are  in  the  minimum  Sv>- join  (and  thus  the  path  from  vj 
to  q  is  not),  and  for  j  >  l  the  path  from  Vj  to  q  is  in  the  minimum  Su,'-join  (and  the  path 
from  p  to  Vj  is  not).  This  vertex  vt  can  be  found  by  binary  search  on  the  vertices  of  the 
path  from  p  to  q.  Further,  we  can  carry  out  the  binary  search  for  each  expanded  edge  of  G 
in  parallel.  Each  such  path  is  of  length  0(nmwmM );  therefore,  executing  the  binary  search 
requires  0(logmax(n,  wmax))  rounds  of  finding  0(m)  5u-joins.  We  have  replaced  one  round  of 
finding  0(nmwmhx)  S^-joins  with  0(logmax(n,u>max))  rounds  of  finding  0(m).  This  algorithm 
will  be  a  factor  of  0(logmax(n,u>msx))  slower  and  require  ft(nu>max)  fewer  processors. 

Thus  we  have  proved  the  following  theorem. 

Theorem  6.4.1  There  exists  a  Las  Vegas  7 INC  algorithm  for  unary-weighted  perfect  matching 
that  requires  0(Iog2  n)  time  and  0(n7,5(mu;max)3)  processors  on  a  CRCW  PRAM,  and  a  Las  Vegas 
RAfC  algorithm  that  requires  0(log2nlogmax(n,  u>max))  time  and  0(n6'5m3u>^ax)  processors. 

Corollary  6.4.2  There  exist  Las  Vegas  7 ZAfC  algorithms  for  the  following  problems  with  the 
indicated  time  and  processor  requirements  on  a  CRCW  PRAM. 

1.  0(log2  n)  time  and  0(n7-5m)  processors: 

•  Finding  a  minimum  T-join  in  an  arbitrary  graph. 

•  Finding  a  maximum  cardinality  1-packing  of  T-cuts  in  a  bipartite  graph. 

•  Finding  the  T-join  structure  of  an  arbitrary  graph. 

2.  0(log2  n )  time  and  0(n5  5(mtomax)3)  processorsor  0(log2  nlogmax(n,u;max))  time  and  0(n5  5m2w^ax) 
processors: 

•  Finding  a  minimum  weight  T-join  when  the  edge  weights  are  given  in  unary 

Proof:  Immediate  using  the  arguments  given  above.  ■ 

Note  that  certifying  that  a  perfect  matching  in  a  bipartite  graph  is  of  minimum  weight  is 
much  simpler,  even  when  the  edge  weights  are  given  in  binary.  It  is  well  known  that  optimal  dual 
variables  for  the  minimum  weight  perfect  matching  problem  can  be  found  via  a  shortest  paths 
computation,  which  can  be  carried  out  in  JVC  [47, 118].  Therefore,  given  a  candidate  minimum 
weight  perfect  matching,  we  can  attempt  to  find  optimal  dual  variables  via  this  shortest  paths 
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computation.  If  we  fail,  we  know  that  the  candidate  perfect  matching  is  not  of  minimum  weight. 
This  approach  does  not  seem  fruitful  in  the  general  non-bipartite  case;  finding  dual  variables 
does  not  appear  to  be  any  easier  than  finding  the  minimum  weight  perfect  matching  itself  [27]. 

6.5  Planar  Multicommodity  Flow 

We  define  the  multicommodity  flow  problem  as  follows.  Let  G  =  (V,  E)  be  an  undirected  graph, 
with  a  set  of  demand  edges  F  C  E  with  demand  q(f)  associated  with  each  edge  /  £  F.  Let 
each  non-demand  edge  have  an  associated  capacity  w(e),  and  specify  that  w  and  q  are  integral. 
When  does  there  exist  for  each  f  €  F  a  flow  function  <}>f  in  ( V,  E— F)  between  the  two  endpoints 
of  /  and  of  value  q(f)  such  that  Ve  £  E  -  F 

\Me)\  <  w(eV 

/€F 

Such  a  set  of  flow  functions  is  called  a  multicommodity  flow. 

Construct  the  graph  Gm  by  replacing  /  £  F  with  2 q(f)  parallel  edges  and  e  £  E  -  F  with 
2w(e)  parallel  edges.  Denote  the  image  of  F  under  this  transformation  as  F*.  Seymour  proved 
[111]  that  if  a  multicommodity  flow  exists  in  G ,  it  can  be  constructed  from  a  maximum  packing 
of  T-cuts  in  the  (bipartite)  planar  dual  of  G‘ ,  where  T  is  the  set  of  nodes  in  the  dual  of  Gm 
that  are  adjacent  to  an  odd  number  of  edges  of  F’ .  An  easy  consequence  of  this  construction 
is  that  the  resulting  flow  values  are  half-integral. 

Since  we  have  demonstrated  an  TlHC  Las  Vegas  algorithm  for  producing  a  maximum  packing 
of  T-cuts  in  a  bipartite  graph,  this  yields  immediately  a  TIAfC  Las  Vegas  algorithm  for  planar 
multicommodity  flow  when  the  demands  and  capacities  are  given  in  unary. 
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Parallel  Algorithms  for  the 
Assignment  Problem1 


7.1  Introduction 

In  this  chapter  we  move  to  a  more  practical  perspective  on  parallel  algorithms  for  network 
problems.  We  present  a  computational  comparison  of  five  different  implementations  of  parallel 
algorithms  to  solve  the  assignment  problem,  the  problem  of  finding  a  minimum-weight  perfect 
matching  in  a  bipartite  graph.  We  focus  on  solving  the  problem  in  a  dense  graph,  where  there 
is  an  edge  between  every  two  nodes.  AH  of  the  algorithms  are  implemented  on  the  Connection 
Machine  CM-2,  a  massively  paraUel  SIMD  Computer  manufactured  by  Thinking  Machines 
Corporation.  We  also  have  attempted  to  evaluate  other  algorithmic  approaches  that  we  did  not 
implement. 

We  have  implemented  three  versions  of  Bertsekas’  auction  algorithm  [9, 10, 13]:  the  standard 
Jacobi  and  Gauss-Seidel  versions  and  a  hybrid  combination  of  the  two  that  seems  not  to  have 
been  considered  before.  This  implementation  is  particularly  interesting  since  it  uses  two  different 
levels  of  the  potential  parallelism  of  the  Connection  Machine.  We  have  also  implemented  two 
versions  of  the  method  of  multipliers  [57,  98,  32,  31].  We  compared  these  implementations  on 
dense  assignment  problems  with  costs  generated  uniformly  and  randomly.  Of  the  five  algorithms 


'This  chapter  describes  joint  work  with  Stavros  Zenios  [123,  122]. 
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the  hybrid  algorithm  proved  definitively  to  be  the  best  in  these  tests,  and  was,  on  1000  x  1000 
problems,  an  average  factor  of  5  -  10  faster  in  Connection  Machine  time  than  the  Jacobi  code, 
the  previous  best  algorithm  implemented  on  the  machine.  This  comparisons  were  on  a  16,384 
processor  Connection  Machine.  The  factor  of  improvement  depended  on  the  cost  range  and 
increases  with  problem  size.  The  computational  results  on  a  32,768  processor  machine  were 
also  competitive  with  the  best  results  achieved  by  other  researchers  on  MIMD  machines  on  a 
similar  distribution  of  problem  instances. 

In  the  computational  studies  we  have  seen  of  algorithms  for  dense  assignment  problems 
the  problem  instances  have  been  generated  with  costs  chosen  uniformly  and  randomly.  We 
discuss  the  implications  of  such  test  data  and  compare  our  implementations  on  other  sorts  of 
distributions  as  well. 

Motivation  for  this  Study:  The  principal/initial  motivation  for  this  study  is  as  follows.  Gold¬ 
berg  implemented  the  Goldberg- Tarjan  maximum  flow  algorithm  on  the  Connection  Machine 
CM-1,  and  tested  it  thoroughly.  This  implementation  yielded  excellent  results  and  excellent 
parallel  speedups  [45].  However  Goldberg  points  out  that 

...  many  maximum-flow  problems  that  appear  in  practice  are  much  smaller 
and  much  simpler  than  the  examples  we  have  generated,  and  most  network-flow 
algorithms  produce  satisfactory  results....  we  have  seen,  however,  that  minimum- 
cost  flow  problems  can  be  solved  by  iterating  a  generalized  version  of  the  maximum- 
flow  algorithm.  Since  large  and  hard  instances  of  the  minimum-cost  flow  problem 
do  appear  in  practice,  fast  maximum-flow  algorithms  are  very  important  in  this 
context  [45]. 

We  initially  implemented  and  studied  the  algorithm  to  which  Goldberg  refers  on  the  Connec¬ 
tion  Machine  CM-2.  The  algorithm  is  due  to  Goldberg  and  Tarjan  and  computes  a  minimum- 
cost  flow  by  successive  approximations,  where  each  approximation  is  computed  by  an  algo¬ 
rithm  that  is  quite  similar  to  the  maximum-flow  code.  However,  despite  the  similarity  between 
the  improve-approximation  routine  of  the  minimum-cost  flow  algorithm  and  the  Goldberg- 
Tarjan  maximum-flow  algorithm,  the  Connection  Machine  version  of  this  minimum-cost  flow 
algorithm  performed  quite  poorly  on  networks  generated  by  the  standard  random-network  gen- 
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erator  of  Kempka  andKennington  [121].  The  basic  problem  is  that,  at  least  on  these  instances, 
the  algorithm  does  not  achieve  good  parallelism  and  leaves  most  of  the  machine  idle  most  of  the 
time.  This  phenomenon  arose  in  several  combinatorial  applications  on  the  machine,  including 
shortest  paths  algorithms  [55]  and  traveling  salesman  problem  heuristics  [96].  It  also  arose  in  an 
initial  implementation  of  Bertsekas’  Jacobi  auction  algorithm  for  the  dense  assignment  problem. 
The  auction  algorithm  and  the  Goldberg- Tarjan  minimum  cost-flow  algorithm  discussed  above 
are  quite  similar,  especially  since  the  assignment  problem  is  just  a  special  case  of  minimum-cost 
flow. 

We  therefore  decided  to  study  alternative  approaches  to  an  auction-like  strategy  in  order 
to  try  to  understand  how  processor  utilization  might  be  increased  in  combinatorial  algorithms 
on  the  Connection  Machine.  We  focused  on  the  assignment  problem  since  it  was  a  special 
case  of  the  minimum-cost  flow  problem  that  nonetheless  captured  many  of  the  same  issues 
with  regard  to  parallel  algorithms.  We  focused  on  the  dense  problem  since  it  yielded  a  very 
communication-efficient  representation  on  the  Connection  Machine,  whereas  communication  for 
sparser  problems  is  comparatively  slow.  Furthermore,  almost  all  of  the  other  studies  on  parallel 
implementations  of  algorithms  for  the  assignment  problem  concentrate  on  the  dense  problem, 
and  thus  we  can  use  the  results  of  this  paper  as  benchmarks  against  the  work  reported  by 
others.  Finally,  we  believe  that  many  of  the  techniques  that  we  developed  will  prove  useful  in 
developing  better  implementations  of  algorithms  for  the  sparse  assignment  problem  and  both 
sparse  and  dense  minimum-cost  flow  problems. 

The  implementations  of  the  methods  of  multipliers  were  motivated  by  the  success  of  Eckstein 
in  using  these  methods  on  sparse  assignment  problems  [32,  31].  These  methods  also  have 
tremendous  potential  to  take  advantage  of  the  massive  parallelism  of  the  Connection  Machine; 
however,  they  proved  to  be  ineffective  on  dense  problems.  The  discussion  of  these  methods  and 
the  evaluation  of  other  possible  algorithms  are  included  in  an  attempt  to  give  a  clearer  picture 
of  the  possibilities  of  developing  an  algorithm  superior  to  our  hybrid  algorithm. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  7.2  we  describe  other  work  on 
implementations  of  parallel  algorithms  for  the  assignment  problem.  In  Section  7.3  we  describe 
the  relevant  details  about  the  algorithms  we  implemented,  and  in  Section  7.4  we  describe  the 
implementations  on  the  CM-2.  In  Section  7.5  we  describe  our  computational  results  on  the 


92 


CHAPTER  7.  PARALLEL  ALGORITHMS  FOR  THE  ASSIGNMENT  PROBLEM 


problems  with  costs  chosen  randomly  from  a  uniform  distribution  on  integers,  and  in  Section 
7.6  we  describe  the  results  on  other  distributions.  In  Section  7.7  we  discuss  the  possibilities  of 
other  approaches  to  the  problem  and  we  give  our  conclusions  in  Section  7.8. 

7.2  Previous  Studies 

A  number  of  researchers  have  studied  parallel  solutions  to  large  dense  assignment  problems  on 
smaller  scale  MIMD  parallel  machines.  These  studies  have  concentrated  entirely  on  the  auction 
algorithm  of  Bertsekas  and  the  Shortest  Augmenting  Paths  Approach  [67].  They  have  also 
concentrated  entirely  on  costs  that  are  generated  randomly  and  uniformly. 

Kennington  and  Wang  [75]  developed  a  parallel  version  of  the  shortest  augmenting  path 
(SAP)  code  of  Jonker  and  Volgenant  [67],  which  they  tested  on  the  Symmetry  S81,  with  up  to 
ten  Intel  80386  cpus.  They  report  solutions  to  dense  1200  x  1200  assignment  problems  with 
cost  range  [0  -  1000]  in  approximately  15  seconds  and  cost  range  [0  -  10000]  in  an  average 
of  under  20  seconds.  They  also  report  that  the  auction  algorithm  did  not  achieve  results 
comparable  to  the  shortest  augmenting  path  in  a  serial  implementation,  and  hence  it  was  not 
parallelized.  Zaki  [125]  continued  this  study  on  an  Alliant  FX/8,  parallelizing  and  vectorizing 
both  algorithms.  He  confirmed  the  observations  of  Kennington  and  Wang,  for  certain  problem 
categories.  However,  his  results  show  that  the  auction  algorithm  achieves  much  better  speedups 
than  the  SAP  code  and  also  vectorizes  very  well.  As  a  result,  the  auction  algorithm  outperforms 
SAP  by  a  large  margin  when  implemented  on  a  vector/parallel  architecture.  He  reports  solutions 
of  2000x2000  problems  with  cost  range  [0-10000]  in  approximately  30  seconds  with  the  auction 
algorithm,  and  2  minutes  with  SAP. 

Kempka,  Kennington  and  Zaki  [74]  tested  the  auction  algorithm  on  the  Alliant  FX/8  without 
the  e-scaling  that  typically  makes  the  algorithm  computationally  effective  and  more  stable. 
They  report,  nonetheless,  solutions  to  a  1000  x  1000  dense  problem  with  cost  range  [0, 100]  in 
under  one  second  and  a  4000  x  4000  problem  in  just  over  a  half  minute.  Since  they  do  not  use 
scaling,  their  results  are  unpredictable:  a  1000  X  1000  problem  takes  12  seconds  for  cost  range 
[0, 1000],  while  a  2000  x  2000  problem  with  the  same  cost  range  takes  over  255  seconds.  For 
cost  range  10000  they  achieve  an  average  of  33  seconds  for  2000  x  2000  problems. 
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Balas,  Miller,  Pekny  and  Toth  [4]  implemented  a  parallel  shortest  augmenting  path  algo¬ 
rithm  on  the  Butterfly  Plus  computer,  with  14  processors.  They  were  able  to  solve  dense 
1000  x  1000  problems  with  cost  range  [0  -  1000]  in  an  average  of  9.39  seconds,  and  cost  range 
[0  -  10000]  in  11.70  seconds.  They  also  solved  dense  2000  x  2000  problems  with  cost  range 
[0  -  10000]  in  30  seconds,  3000  x  3000  in  a  minute,  and  a  dense  30000  x  30000  problem  with 
900  million  variables  in  less  than  an  hour. 

Bertsekas  and  Castanon  [11]  did  an  extensive  study  of  several  variants  of  the  algorithm 
on  20%  dense  problems  on  the  Encore  Multimax.  They  tested  both  Jacobi  and  Gauss-Seidel 
versions  and  a  block-Jacobi  implementation.  An  interesting  feature  of  this  study  is  that  they 
develop  asynchronous,  as  well  as  synchronous,  parallel  implementations.  The  asynchronous 
algorithm  has  a  substantial  advantage  over  its  synchronous  counterpart.  They  were  able  to 
solve  problems  of  size  1000  x  1000  in  under  10  seconds. 

Castanon,  Smith  and  Wilson  [19]  studied  the  effectiveness  of  different  synchronous  imple¬ 
mentations  of  the  Gauss-Seidel  auction  algorithm  and  the  shortest  augmenting  paths  code  of 
Jonker  and  Volgenant  [67]  for  solving  both  dense  and  sparse  assignment  problems  on  a  variety 
of  architectures.  They  demonstrated  speedups  of  up  to  60  for  the  Gauss-Seidel  implementation 
of  the  auction  algorithm  for  problems  of  size  1000  x  1000. 

There  are  two  studies  of  Connection  Machine  algorithms  that  are  relevant  to  our  work. 
Cindy  Phillips  wrote  the  first  version  of  the  Jacobi  auction  algorithm  on  the  Connection  Machine 
[95,  96].  Most  of  the  important  details  of  the  Jacobi  implementation  were  developed  by  her, 
however  she  was  not  able  to  thoroughly  test  the  code  and  solves  only  one  example  for  most  cost 
ranges. 

Eckstein  did  an  empirical  study  of  h'«  alternating  direction  method  of  multipliers  for  general 
linear  cost  sparse  networks  on  the  Connection  Machine  CM-2  [32,  31].  The  computational 
results  were  encouraging  for  sparse  assignment  problems. 

There  has  been  a  variety  of  recent  related  work  on  Connection  Machine  algorithms  for  net¬ 
work  optimization.  We  have  already  mentioned  that  Goldberg  implemented  a  fast  maximum 
flow  algorithm  [45].  Zenios  and  Lasken  [127]  solved  nonlinear  network  problems;  Zenios  [126] 
developed  algorithms  for  multicommodity  transportation  problems;  Nielsen  and  Zenios  [128] 
developed  algorithms  for  stochastic  network  optimization  models  arising  in  financial  applica- 
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tions.  Eckstein  [32,  31]  has  extensively  studied  the  alternating  step  method  for  transportation 
problems.  As  stated  above,  his  results  for  linear  cost  networks  are  not  encouraging,  but  the 
results  for  sparse  quadratic  transportation  problems  are  quite  good  and  are  competitive  with 
the  massively  parallel  row-action  algorithm  of  Zenios  and  Censor  [20].  Furthermore,  his  method 
appears  to  be  the  able  to  solve  problems  with  mixed  linear  and  quadratic  objective  terms  with 
little  additional  difficulty. 


7.3  Algorithms  for  the  Assignment  Problem 

The  assignment  problem  is  to  find  the  minimum-weight  perfect  matching  in  a  bipartite  graph. 
At  times  in  this  chapter  we  will  phrase  the  problem  in  terms  of  n  people  and  n  objects,  with  a 
benefit  atJ-  associated  with  the  assignment  of  object  i  to  person  j.  We  seek  an  assignment  A  of 
objects  to  people  (A(i)  =  j  means  that  object  i  is  assigned  to  person  j )  so  that  aU 

maximized.  In  a  globally  optimal  solution  any  given  person  may  not  be  assigned  to  his  most 
valuable  object.  However,  for  a  globally  optimal  assignment  it  is  possible  to  assign  a  price  i r,- 
to  each  object  i,  so  that  if  each  person  j  views  the  profit  associated  with  object  z  as  ai;-  -  tt,- 
then  each  person  is  assigned  to  his  most  profitable  object.  This  fact  can  be  understood  as  a 
consequence  of  linear  programming  duality. 

7.3.1  The  Auction  Algorithm 

The  auction  algorithm  finds  the  optimum  assignment  by  finding  such  prices  for  all  the  objects. 
It  produces  an  assignment  A  and  prices  zr,-  such  that 


a.A(0 


?f(i)  ^  max  (ajA(i)  it] ). 

}—l . n 


The  algorithm  starts  with  each  object  assigned  an  arbitrary  price;  prices  are  adjusted  upwards 
as  people  bid  for  their  most  profitable  object.  Each  iteration  of  the  algorithm  consists  of  one 
or  several  .currently  unassigned  people  choosing  the  object  that  is  most  profitable  to  them  and 
submitting  a  “bid”  on  the  object.  Each  object  that  has  been  bid  upon  is  assigned  to  the  highest 
bidder,  adjusts  its  price  to  the  bid,  and  deassigns  the  person  to  whom  it  was  previously  assigned 
(if  anyone). 
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Formally,  each  person  bidding  calculates  the  profits  piut  and  pnext  associated  with  his  two 
most  profitable  objects,  and  then  bids  7r<  +Pte,t  —  Pnext  on  his  best  object  i.  A  bid-upon  object 
then  is  assigned  to  the  highest  bidder  and  sets  its  price  to  that  bid. 

Epsilon  Scaling 

Epsilon-scaling  is  used  in  the  auction  algorithm  in  order  to  improve  upon  the  worst-case  time 
bounds  and  computational  behavior.  We  relax  the  condition  that  an  object  bids  upon  and  is 
assigned  to  its  favorite  object: 

(tiA(i)  -  7T{  >  imaxn(aiA(i)  -  jr,-). 

Instead  we  merely  require  that 

a,A(i)  ~  >  max  ( ajA(i)  -  Try)  -  e. 

J  —  1  ••••,?! 


Such  an  assignment  is  called  e-optimal.  The  algorithm  runs  in  a  series  of  phases ,  each 
phase  taking  an  e-optimal  assignment  and  returning  a  set  of  prices  and  an  assignment  that  is 
d  =  (e/c)-optimal,  for  some  user-specified  constant  c.  Bertsekas  proved  that  if  an  assignment 
is  e-optimal  with  e  <  ~}  and  the  ay  are  all  integers,  then  the  assignment  is  globally  optimal 
[9],  Define  Ma  =  maxy  ay.  Since  any  assignment  is  A/„-optimal,  we  can  start  with  e  =  Ma  and 
in  0(log(A/„n))  phases  we  will  produce  an  assignment  that  is  (^)-optimal  and  thus  globally 
optimal.  Each  phase  of  the  algorithm  is  a  mini-auction  as  described  in  the  previous  section, 
except  that  instead  of  bidding  pj„<  -  pn„„  a  person  can  bid  p4e„  -  pnext  +  e.  Each  phase 
produces  a  e-optimal  assignment,  and  by  successively  lowering  e  in  this  fashion  we  obtain  an 
optimal  assignment.  The  worst-case  sequential  complexity  of  the  algorithm  on  dense  problems 
is  0(n3  x  log(Man)),  and  there  is  no  known  proof  of  worst-case  parallel  speedup.  A  summary 
of  the  algorithm  is  as  follows. 

Step  0  (Initialization)  e  «-  maxy  ay,  tt;-  <—  0. 


Step  1  Auction: 
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1.1  Some  subset  of  the  set  of  unassigned  people  determine  and  bid  on  their  favorite 
object. 

1.2  Each  bid-upon  object  determinesdts  highest  bidder,  raises  its  price  to  that  bid,  and 
assigns  itself  to  that  person,  deassignirig  the  person  to  whom  it  was  previously  as¬ 
signed  (if  anyone). 

1.3  If  any  person  is  unassigned  goto  1.1. 

Step  2  If  e  <  £  Stop;  else  e  <-  c/2.  People  whose  assignments  are  no  longer  e-optimal  deassign 
themselves. 

Step  3  Goto  Step  1. 

Jacobi  vs.  Gauss-Seidel 

Note  that  in  the  most  general  form  of  the  algorithm  any  subset  of  those  people  unassigned  can 
bid  simultaneously.  The  two  traditional  parallel  variants  of  the  auction  algorithm  are  the  Jacobi 
version  and  the  Gauss-Seidel  version.  By  the  Jacobi  version  we  refer  to  an  algorithm  in  which 
all  unassigned  people  bid  simultaneously  on  their  favorite  objects  before  the  prices  are  adjusted, 
whereas  in  Gauss-Seidel  only  one  person  bids  at  a  time.  Since  in  the  Gauss-Seidel  version  each 
bid  takes  advantage  of  the  updated  price  information  of  the  previous  bids,  it  usually  takes 
fewer  total  bids  to  produce  an  optimal  assignment.  The  Jacobi  method,  however,  has  greater 
potential  for  a  massively  parallel  implementation. 

In  general  the  terms  “Jacobi”  and  “Gauss-Seidel”  are  not  used  solely  with  regard  to  the 
auction  algorithm.  “Jacobi”  is  used  to  refer  to  a  method  where  the  prices  (dual  variables)  at 
time  t  + 1  are  updated  only  with  respect  to  the  information  at  time  t ,  whereas  a  “Gauss-Seidel” 
iteration  updates  a  dual  variable  with  respect  to  the  most  recent  information.  Thus  a  Jacobi 
method  allows  the  updating  of  all  prices  in  parallel,  whereas  a  Gauss-Seidel  method  allows  two 
prices  to  be  updated  in  parallel  only  if  the  update  of  one  does  not  depend  on  the  relevant  values 
of  the  other.  This  Jacobi/Gauss-Seidel  distinction  is  one  of  the  primary  differences  between  the 
two  methods  of  multipliers  we  discuss. 
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7.3.2  The  Method  of  Multipliers  Algorithm 

The  method  of  multipliers  (MOM)  is  a  general  method  for  a  variety  of  problems  in  convex 
programming  [57, 98];  we  summarize  here  a  specialization  of  the  method  to  assignment  problems 
suggested  in  [12]  and  we  refer  the  reader  there  for  a  full  development  and  explanation  of  the 
algorithm.  Since  the  methods  of  multipliers  we  describe  here  are  specializations  of  multiplier 
methods  for  linear  programs,  we  first  formulate  the  assignment  problem  as  a  linear  program, 
and  without  loss  of  generality  as  a  minimization  problem.  We  let  E  denote  the  set  of  edges  of 
the  network;  in  our  dense  case  E  contains  an  edge  (i,j)  between  every  person  i  and  object  j. 

minimize  £  «0'/'7 

subject  to 

£  fij  =  1  V»  =  1,  •  •  •  n, 

£  fij  =  1  Vj  =s  !,•••«, 

MW>€®> 

0  <  fa  <  1  V(*\j)€A. 

We  assign  dual  prices  r,-  and  p,-  to  the  equality  constraints.  The  method  of  multipliers  for 
this  problem  results  in  a  Gauss-Seidel  iteration  to  minimize  the  Augmented  Lagrangian.  the 
iterative  step  has  the  form 

Step  0  fj  <—  0,  r,-  <—  0,  Pj  <—  0. 

Step  1  Update  the  values  of  /  as  follows 

1  1  + 
fij  =  [fij  +  2^t)  +  Pj^  ~  a,J  +  cWyi^  +  wi(f))]J  .  Vedges(i,;) 

where  r/,(<)  and  wj(t)  are  given  in  terms  of  by 

Vi{t)  =  (1  -  £  fij(t))  Vi  =  1,  •••,», 

WIW)6^} 
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wj( 0  =  (1  ~  £  fij(t))  Vj  =  1,  •  •  •,  n, 

c(t )  is  a  nondecreasing  sequence  of  of  positive  constants  and  [®]+  indicates  the  projection 
of  x  onto  [0, 1]. 

Step  2  At  the  end  of  the  minimization  yielding  /y(t  +  1),  y((t  +  1)  and  Wj(t  +  1)  for  all  i,j, 
update  the  prices  r(  and  pj  according  to 

Ti(t+  1)  =  ri(t)  +  c(t)yi(t  +  l),Vt  =  n, 


p;(t  +  1)  =  pJ(0  +  c(0«'j(<  +  l)»Vj  =  !,•••,» 


Step  3  Check  for  convergence;  if  iteration  has  not  converged  goto  Step  1. 

7.3.3  The  Alternating  Direction  Method  of  Multipliers 

The  alternating  direction  method  of  multipliers( ADMOM),  due  to  Eckstein  [32],  takes  a  Jacobi 
approach  to  updating  the  augmented  Lagrangian  This  method  as  well  has  a  variety  of  appli¬ 
cations  to  convex  programming,  [12,  32,  31,  33,  34].  The  application  of  this  method  to  the 
assignment  problem  results  in  a  Jacobi-type  algorithm  which  is  more  suitable  for  massively 
parallel  computation  than  the  method  of  multipliers.  The  iteration  proceeds  as  follows. 

Step  0  fij  *-  0,  r,-  <-  0,  Pj  «-  0. 

Step  1  Update  the  /y  as  follows 

1  j  + 

/«;(*  +  1)  =  fij(t)  +  —  [r,(t)  +  pj(t)  -  ay  +  c(y,(t)  +  Wj(t})]  Vedges 
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r,(t  +  l)  =  Ti(t)  +  cyi(t  +  l),Vi  = 


Pj(t  +  1)  =  Pj(t )  +  cwj(t  +  1),  Vj  =  1,  — ,  n 


where  y,-(t)  and  Wj(t)  are  given  in  terms  of  by 


- 

“AO  =  jjU  - 


2  />;(  0)iY7  = 

Mb' ,;)€£} 


and  c  is  a  constant. 

Step  2  Check  for  convergence;  if  iteration  has  not  converged  goto  Step  1. 

Note  that  in  ADMOM  all  fy  can  be  updated  simultaneously  while  MOM  relies  on  y  and 
w  being  up  to  date,  and  therefore  only  one  value  fij  can  be  updated  in  each  row  and  column 
at  a  time.  Therefore  we  can  only  do  n  updates  at  a  time  for  MOM  while  we  can  do  n2  for 
ADMOM;  however,  due  to  the  Gauss-Seidel  nature  of  the  minimizing  iteration  in  MOM,  we 
would  expect  that  the  algorithm  would  converge  much  more  quickly.  Furthermore  the  price 
updates  for  ADMOM  are  divided  by  the  number  of  arcs  incident  to  that  node,  which  for  dense 
problems  is  n.  Therefore  for  large  dense  problems  the  prices  will  only  change  a  small  amount 
in  each  iteration  and  therefore  the  number  of  iterations  has  the  potential  to  be  large.  Note  that 
neither  of  these  algorithms  has  been  proved  to  be  a  polynomial- time  algorithm,  although  they 
have  been  proved  to  converse  to  the  correct  answer  in  a  finite  amount  of  time. 
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7 A  Designs  for  Massively  Parallel  Implementation 

In  this  section  we  discuss  the  issues  involved  in  implementing  efficiently,  the  algorithms  of  the 
previous  section  on  a  Connection  Machine  CM-2. 

7.4.1  The  Connection  Machine  CM-2 

The  Connection  Machine  CM-2  is  a  massively  parallel  computer  with  up  to  65,536  processors. 
Each  processor  has  a  single-bit  processing  unit  and  64K  or  256K  bits  of  local  RAM.  The 
processors  run  in  SIMD  mode  and  are  connected  in  an  n-cube  topology.  The  system  software 
provides  global  maximum  operations  as  well  as  scan  and  spread  operations  that  are  parallel 
prefix  operations  [14].  The  CM-2  uses  a  front  end  such  as  a  SUN-4,  VAX  or  Symbolics  Lisp. 
Machine.  Parallel  extensions  to  the  programming  languages  LISP,  C  and  FORTRAN,  via  the 
front-end,  allow  the  user  to  program  the  Connection  Machine  and  the  front-end  system.  For 
further  information  see  [26]  and  [58]. 

A  Connection  Machine  can  emulate  a  large  number  of  processors  by  having  each  physical 
processor  simulate  a  number  of  virtual  processors.  The  ratio  of  the  number  of  virtual  processors 
to  the  number  of  physical  processors  is  referred  to  as  the  virtual  processor  ratio ,  or  vp-ratio. 
Using  standard  Gray  coding  the  processors  of  the  CM  can  be  configured  as  a  Ar-dimensional 
grid;  to  represent  a  n  x  n  assignment  problem  we  configure  the  CM  as  a  N  >i  N  grid,  where  N 
is  n  rounded  up  to  the  nearest  power  of  two.  Row  i  is  associated  with  objects'  and  column  j  is 
associated  with  object  j.  In  particular,  processor  (*,j)  stores  the  value  a,;-  of  object  i  to  person 
j,  local  variables  applicable  to  person  j  such  as  the  most  profitable  object  to  that  person,  and 
local  variables  applicable  to  object  i,  such  as  its  price.  A  specified  number  of  virtual  processors, 
along  with  a  configuration  of  the  machine,  is  called  a  vp-set,  and  it  is  possible  to  use  several 
different  vp-sets  in  the  same  computation  and  to  switch  between  them  when  desired. 

A  mapping  of  virtual  processors  in  a  grid  to  the  physical  processors  of  the  CM  is  known 
as  a  geometry.  The  user  has  the  freedom  to  dictate  a  variety  of  the  features  of  this  mapping. 
In  particular  we  exploit  the  ability  to  specify  which  axes  or  directions  of  the  grid  shou.u  have 
more  physical  processors  representing  them,  and  which  should  have  more  virtual  processors. 
In  other  words,  we  can  specify  that  virtual  processors  that  are  adjacent  along  one  axis  are  on 
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different  physical  processors,  and  that  processors  that  are  adjacent  on  another  axis  are  on  the 
same  physical  processor.  The  mapping  of  the  N  x  N  grid  onto  the  physical  CM  will  differ 
among  our  implementations,  and  we  will  use  alternate  representations  as  well,  but  this  is  the 
basic  representation  common  to  all  the  codes. 

7.4.2  Massively  Parallel  Implementations  of  the  Auction  Algorithm 

In  this  section  we  discuss  the  implementation  of  the  Jacobi  and  Gauss-Seidel  variants  of  the 
auction  algorithm,  together  with  a  new  version  that  we  refer  to  as  a  “hybrid”  algorithm. 

The  Jacobi  Algorithm 

In  the  implementation  of  the  Jacobi2  algorithm  we  have  one  N  x  N  vp-set  which  is  used  both  to 
store  the  data  and  for  computation.  It  is  mapped  onto  the  physical  CM  processors  in  the  default 
fashion,  which  balances  the  two  axes  so  that  they  have  a  comparable  physical  processor/virtual 
processor  makeup.  A  detailed  summary  of  the  Jacobi  implementation  is  as  follows. 

Step  0  We  keep  a  copy  of  e  in  each  processor.  Using  a  global  maximum  operation  and  a 
broadcast,  set  e  <—  (n  +  1  )Ma.  In  each  processor  set  x  <—  0.  We  keep  two  boolean 
variables  in  each  processor:  assigned-here,  which  is  True  in  processor  (i,j)  if  object  i  is 
assigned  to  person  j,  and  person-assigned,  which  is  True  in  all  processors  of  column  j 
if  person  j  is  assigned  to  an  object.  In  all  processors  set  person-assigned  <-  False  and 
assigned-here  «—  False.  Also  scale  all  values  of  a  by  n  +  1. 

Step  1  Determine  if  everyone  is  assigned  (a  global  or  operation  of  person-assigned).  If  a 
person  is  unassigned  proceed  to  Step  2.  If  every  person  is  assigned  to  an  object  and  e  <  1 
the  algorithm  terminates.  If  e  >  1  reduce  its  value,  deassign  everyone  whose  current 
assignment  is  no  longer  €-optimal  for  the  new  e,  and  proceed  to  Step  2. 

Step  2  Select  all  the  processors  associated  with  unassigned  people.  Set  profit  <—  a  -  n. 
Within  each  column  j  find  the  row  index  ij  of  the  most  profitable  object  by  forming 

Preliminary  experiences  with  this  algorithm  have  already  been  reported  on  in  [95];  it  is  included  here  for 
completeness  and  in  order  to  present  additional  computational  results.  The  description  of  the  algorithm  is 
adapted  from  that  paper. 
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Number  of  Bidders 

Fraction  of  Iterations 

1 

38.60% 

1-10 

82.35% 

1-100 

95.95% 

1-500 

98.78% 

Table  7.1:  Statistics  on  Number  of  Bidders  active 

the  concatenation  of  the  profit  and  column  number  j  in  each  processor  and  doing  a  grid 
spread-with-max.  Set  best  *-  ij.  Turning  off  processor  (best,  j)  do  another  grid  spread- 
with-max  to  find  the  profit  p  of  the  next-best  object  and  set  next-best  <—  p.  Person  j 
bids  on  object  best  by  setting  the  variable  bid  in  processor  (best,j).  The  value  of  the 
bid  is  computed  as  follows:  Let  uqjest be  the  maximum  profit  from  all  objects  except 
best.  The  bid  from  j  to  best  is  a^ast,/  -  ^best,;  +  e> 

Step  3  Using  a  grid  spread-with-max  along  the  rows,  determine  the  maximum  bid  on  each 
object  and  update  the  prices  tt  within  the  columns.  For  all  objects  bid  upon,  assign  the 
object  to  the  highest  bidder  by  setting  assigned-here.  Update  person-assigned  using 
or-grid-scans  within  the  columns.  Go  to  Step  1. 

The  CM  has  the  ability  to  potentially  perform  thousands  of  bids  at  once;  therefore,  this 
seems  to  be  a  very  attractive  method  for  this  architecture.  However,  this  approach  leads  to 
a  large  sequential  tail.  Most  people  are  assigned  to  objects  very  quickly,  and  the  bulk  of  the 
computational  time  is  spent  in  establishing  the  last  few  assignments,  hence  greatly  reducing  the 
amount  of  parallelism.  Table  7.1  gives  a  typical  breakdown  of  processor  utilization  encountered 
in  a  1000  x  1000  problem  with  cost  range  [0  -  1000]. 

Two  optimizations  were  implemented  to  minimize  this  tail  effect.  The  first  was  to  optimize 
the  case  when  only  one  bidder  was  active:  the  preferred  object  for  that  sole  bidder  was  found  by 
a  global  maximum  over  all  the  processors  as  opposed  to  the  more  powerful  but  slower  max-scan 
operation.  The  latter  was  unnecessary  for  the  case  of  one  bidder. 

The  second  optimization  was  the  truncation  of  the  tail:  instead  of  running  each  phase  until 
a  complete  e-optimal  assignment  was  achieved,  the  auction  was  terminated  when  kVo  of  the 
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Problem  Size 

Maximum  Cost 

.  Speedup(its.) 

Speedup(Time) 

128 

100 

5.62 

4.61 

128 

1000 

4.30 

3.70 

256 

100 

3.23 

3.28 

256 

1000 

7.37 

6.35 

Table  7.2:  Improvements  in  iterations  and  time  gained  by  truncation  of  the  tail.  Each  entry 
is  an  average  over  five  randomly  generated  examples. 

people  have  been  matched,  k  increases  as  epsilon  decreases,  so  that  k  =  100  in  the  last  phase. 
This  optimization  resulted  in  substantial  speedups.  An  implementation  that  initially  matches 
only  80%,  and  matches  progressively  more  in  successive  phases  does  an  average  of  5.1  times 
fewer  iterations  than  an  implementation  that  completes  each  phase,  on  randomly  uniformly 
generated  problems  of  size  128-256.  (See  Table  7.2.) 

The  Gauss-Seidel  Algorithm 

In  this  version,  where  we  do  one  bid  at  .a  time,  we  must  continue  to  store  the  n  x  n  problem 
on  the  CM,  which  will  necessarily  be  configured  at  a  high  vp-ratio.  We  would  like,  however,  to 
perform  the  computations  at  a  lower  vp-ratio,  and  therefore  more  quickly,  on  a  grid  of  reduced 
dimension.  We  introduce  a  mapping  of  the  N  xN  grid  to  the  physical  machine  that  will  enable 
us  to  extract  the  information  necessary  for  one  bid,  and  compute  the  bid  in  a  vp-set  with 
vp-ratic  1.  We  do  this  in  a  fashion  that  requires  no  communication  between  the  two  vp-sets. 

We  define  the  geometry  of  the  N  x  N  vp-set,  which  we  will  call  data-vp-set,  so  that  the  y 
axis  is  physical.  Processors  (i,j)  and  (i,  k)  are  on  different  physical  processors  for  all  j  ^  ft. 
All  of  the  virtual  bits  are  along  the  x  axis;  processors  (0,j),(l,j),..., (vp-ratio-  1  ,j)  will  all 
be  mapped  to  the  same  physical  processor,  as  will  all  processors  (q,j),  where  q  €  [vp-ratio* 
k,  vp-ratio*(A:+l)-l].  We  define  a  second  vp-set,  compute-vp-set ,  of  vp-ratio  1  that  has  N  rows 
and  (Number  of  Physical  CM  processor s)/N  columns.  For  example,  if  N  =  1024  and  we  are 
working  on  a  Connection  Machine  with  16,384  processors,  compute-vp-set  will  be  a  1024x16  vp- 
set  (1024  rows,  16  columns).  A  processor  in  compute-vp-set  shares  the  same  physical  processor 
with  64  processors  in  the  data-vp-set.  Note  that  an  entire  column  in  data-vp-set  shares  the  same 
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f 


Column  2  of  compute-vp-set  mapped  to  same 
physical  CM  processors  as  128-191  of  data-vp-set 

f - 1 


2  128-191 

1 6  columns  1024  (people) 


Figure  7.1:  The  two  vp-sets  for  the  Gauss-Seidel  algorithm,  compute-vp-set  and  data-vp-set 
for  a  1024  x  1024  problem  on  a  16K  CM.  Each  vertical  block  of  data-vp-set  represents  64  virtual 
columns  that  all  reside  on  the  same  physical  column.  Only  8  blocks  are  portrayed  here. 

physical  processors  with  one  of  the  columns  of  compute-vp-set.  We  will  transfer  the  information 
we  need  for  one  bid  from  data-vp-set  to  compute-vp-set  by  having  a  processor  in  compute-vp-set 
just  point  to  the  relevant  data  in  data-vp-set.  See  Figure  7.1. 

To  execute  the  bid  for  the  ith  person  we  select  his  column  Cjata  in  data-vp-set,  and  identify 
the  column  in  compute-vp-set  with  which  it  is  coincident:  cC0mpute  =  [vp^iaU0\  •  Each  parallel 
variable  has  a  pointer  to  the  physical  location  where  its  data  resides;  we  simply  change  the 
pointer  of  the  parallel  variable  in  compute-vp-set  to  point  at  the  data  in  the  data-vp-set.  For 
example,  in  our  1024  x  1024  problem  on  a  16, 384  processor  machine,  suppose  person  303  is 
bidding.  All  the  relevant  information  resides  on  the  same  physical  processors  as  does  column  4 
of  the  compute  vp-set.  Pointers  are  changed  so  that  column  4  points  to  the  data  of  column  303 
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in  data-vp-$et,  and  the  bid  is  carried  out.  Note  that  the  price  information  need  only  be  stored 
in  compute-vp-set.  The  Gauss-Seidel  algorithm  is  thus  as  follows: 

Step  1  Pick  an  unassigned  person  and  calculate  with  which  physical  column  he  is  associated; 
set  the  appropriate  pvars  in  compute-vp-set  to  point  to  his  values.  (Need  to  change  only 
one  pointer  per  pvar.) 

Step  2  In  compute-vp-set,  calculate  his  bid.  (This  requires  two  global  maximum  computations: 
one  to  calculate  the  best  object  and  one  for  the  next  best.) 

Step  3  Update  the  price  information,  which  only  need  be  kept  in  compute-vp-set. 

Step  4  Goto  step  1. 

In  contrast  to  the  Jacobi  implementation,  where  all  information  is  kept  on  the  CM,  it  is 
more  efficient  here  to  keep  track  of  who  is  assigned,  and  to  what  object,  in  front-end  lists  and 
arrays.  This  avoids  significant  amounts  of  communication.  In  fact  we  utilize  a  copy  of  the  a,-;- 
that  is  kept  on  the  front  end  as  well  to  calculate  the  bid  and  in  this  way  avoid  CM  to  front-end 
communication  time. 

Note  that  from  the  end  of  one  phase  to  the  start  of  the  next,  when  epsilon  is  decreased 
to  e',  often  many  of  the  e-optimal  assignments  from  the  previous  phase  are  e'-optimal  as  well, 
and  need  not  be  recomputed  in  the  next  phase.  In  the  Jacobi  code  these  assignments  are 
identified  and  preserved  for  the  next  phase;  this  is  easy  with  a  Jacobi  representation  since  each 
assignment  can  be  checked  in  parallel.  In  the  Gauss-Seidel  case,  when  this  information  is  stored 
on  the  front  end,  it  is  a  sequential  computation  to  determine  who  is  e'-optimal  and  thus  we  do 
not  check.  Not  preserving  these  assignments  does  increase  the  total  number  of  Gauss-Seidel 
bids,  since  we  are  throwing  away  information,  but  the  gain  in  computation  time  outweighed 
the  increase  in  iterations;  thus  we  chose  not  to  preserve  them.  We  tested  our  Gauss-Seidel 
implementation  against  a  sequential  Gauss-Seidel  implementation  that  we  obtained  from  D.P. 
Bertsekas  on  problems  of  size  64  -  256  and  found  that  the  number  of  bids  they  performed  was 
comparable. 
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The  Hybrid  Jacobi/Gauss-Seidel  Algorithm 

The  Jacobi  code  is  very  efficient  when  a  large  number  of  people  are  unassigned  and  are  bidding 
and  the  Gauss-Seidel  code  makes  effective  use  of  the  machine  when  there  are  few  active  bidders. 
It  can  do  one  bid  at  a  time  very  quickly  and  also  decreases  the  total  number  of  bids  needed. 
These  characterizations  suggest  that  a  better  use  of  the  CM  is  a  hybrid  combination  of  these 
approaches.  Execute  the  Jacobi  algorithm  early  in  the  phase  when  large  numbers  of  people 
are  unassigned.  When  most  are  assigned,  switch  to  Gauss-Seidel.  Note  that  the  deterministic 
algorithm  with  the  best  worst-case  running  time  bound  0*(n$  log(nA/„))  uses  a  very  similar 
two-phase  approach  with  the  second  phase  being  a  shortest  paths  algorithm  [46].  Ahuja  and 
Orlin  devised  a  sequential  two-phase  auction  algorithm  with  an  improved  running  time  over  a 
standard  vanilla  auction  algorithm  (1). 

Making  the  transition  from  Jacobi  to  Gauss-Seidel  will  require  a  certain  amount  of  com¬ 
munication  of  the  data  during  a  phase,  since  the  way  that  the  grids  are  mapped  onto  the 
Connection  Machine  is  different  for  Jacobi  and  Gauss-Seidel.  This  cost  is  far  outweighed  by 
the  gains  in  computation  time. 

The  hybrid  algorithm  uses  two  thresholds,  thresh  1  and  thresh2.  thresh 1  determines  in 
which  phases  we  employ  both  methods;  thresh2  determines  when  in  the  phase  we  switch  to 
Gauss-Seidel.  If  a  phase  is  aiming  to  assign  k%,  k%  >  threshl  then  we  switch  to  Gauss-Seidel 
when  thresh2xk%  of  the  people  are  assigned.  Computational  testing  showed  that  for  problems 
of  size  1000  x  1000  and  2000  x  2000  with  uniformly  randomly  generated  costs  the  best  setting 
of  these  parameters,  although  different  for  different  cost  ranges,  was  threshl  =  thresh2  =  97 
or  98.  Interestingly,  for  both  problems  this  is  close  to  the  y/n  turning  point  used  theoretically 
in  [1]. 

7.4.3  Massively  Parallel  Implementations  of  the  Methods  of  Multipliers 

In  contrast  to  the  auction  algorithm,  which  only  has  the  potential  to  do  n  bids  at  a  time,  MOM 
always  does  n  updates  at  a  time  and  ADMOM  does  n2.  Therefore  these  methods  are  very  at¬ 
tractive  for  parallel  computation.  The  description  of  the  Connection  Machine  implementations 
of  these  algorithms  is  relatively  straightforward  given  our  previous  discussion.  We  configure  the 
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Size 

cost 

Jacobi  Bids 

Jacobi  Iterations 

Average  Parallelism 

GS  Bids 

00 

^-4 

100 

3837 

147 

26.49 

1567 

128 

1000 

4114 

215 

20.77 

2099 

256 

100 

10210 

813 

14.56 

5102 

256 

1000 

9054 

264 

40.56 

6244 

Table  7.3:  Jacobi  vs.  Gauss-Seidel  Statistics 


Connection  Machine  as  an  N  x  N  geometry,  with  equal  preference  given  to  the  physical  makeup 
of  each  axis,  (as  we  did  with  the  pure  Jacobi  auction  algorithm).  Each  processor  (i,j)  stores  the 
current  value  of  fa,  ph  r;-,  y{,  and  Wj.  For  the  ADMOM  algorithm,  we  need  merely  to  do  one 
spread-with-add  operation  along  each  axis  in  order  to  calculate  J2{j\(i,j)eE}  Vt"  =  1, ... . ,  n, 
and  /y(0»Vj  -  1,  •  •  •,  n;  then  all  that  is  required  is  several  arithmetic  operations  that 

all  happen  within  each  processor  with  no  further  communication  required.  For  the  MOM  algo¬ 
rithm  we  must  select  groups  of  n  processors  such  that  only  one  processor  is  selected  in  each  row 
and  column.  The  strategy  that  we  use  is  to  select  processors  (i,j)  such  that  i  +  j  =  k( mod  n ), 
and  loop  over  k  =  0, . . . ,  n  -  1. 

7.5  Computational  Results  and  Discussion 

In  this  section  we  discuss  the  performance  of  the  two  algorithms  on  dense  problems.  All  the 
problem  data  was  generated  by  generating  integer  costs  randomly  and  uniformly  using  the 
Connection  Machine  random  number  generator.  This  is  a  very  specialized  distribution  and 
problems  drawn  from  this  distribution  are  generally  understood  to  be  easy,  in  that  most  of  the 
lower  weight  edges  are  not  necessary.  In  fact,  one  can  usually  solve  these  problems  by  removing 
75%  of  the  edges,  those  with  smaller  costs,  and  solving  the  remaining  problem.  If  the  result  is 
not  optimal  it  can  usually  be  patched  up  quickly  [92], 

Despite  these  considerations,  we  believe  that  computational  results  on  this  distribution  still 
are  meaningful,  for  several  reasons.  First  of  all,  all  computational  studies  which  we  know  of 
test  only  on  this  distribution;  therefore  testing  such  problem  instances  provides  some  ability  to 
compare  different  implementations  and  architectures.  Secondly,  removing  the  “bottom”  75% 
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Size 

cost 

Jacobi  Time 

Gauss-Seidel  Time 

Hybrid  Time 

1024 

100 

197.5/106.2 

585.5/124.8 

105.1/25.5 

1024 

1000 

414.2/219.5 

242.7/46.5 

58.6/22.1 

1024 

10000 

276.7/146.0 

222.2/44.3 

92.3/20.8 

Table  7.4:  Running  times  of  the  algorithms,  in  seconds,  averaged  over  five  random  examples. 
Both  total  and  CM  time  are  given. 


of  the  arcs  is  not  particularly  useful  on  the  Connection  Machine.  Although  it  would  allow 
the  machine  to  be  configured  at  a  lower  vp-ratio,  the  resulting  sparse  communication  pattern 
that  would  result  would  be  significantly  slower  than  that  available  when  one  has  a  fully  dense 
problem  [31],  Finally,  despite  the  fact  that  these  problems  are  understood  to  be  easy,  they 
seem  'not  to  be  easy  at  all  for  a  massively  parallel  architecture  due  to  the  tail  phenomenon  we 
have  discussed.  Their  “easiness”  may  in  some  sense  cause  this  difficulty,  since  most  of  the 
assignments  are  easy  to  compute  and  it  is  the  computation  of  the  last  few  that  takes  a  great 
deal  of  time.  Nonetheless,  it  is  important  to  understand  if  a  massively  parallel  architecture  can 
achieve  good  results  on  these  sorts  of  instances. 

In  the  next  section  we  discuss  computational  results  on  instances  drawn  from  different 
distributions. 

7.5.1  The  Performance  of  the  Auction  Algorithm 

The  auction  algorithms  were  initially  implemented  in  a  combination  of  *Lisp  and  Lisp/Paris 
and  were  run  with  a  SUN4  front  end.  A  comparison  of  the  number  of  bids  done  by  the  Jacobi 
and  Gauss-Seidel  codes  on  problems  of  moderate  size  is  given  in  Table  7.3.  We  see  that  the 
Gauss-Seidel  code  can  do  significantly  fewer  bids  than  Jacobi,  by  as  much  as  a  factor  of  two 
or  greater.  We  also  see  that,  as  expected,  the  average  number  of  people  bidding  in  the  Jacobi 
code  is  much  less  than  n. 

Table  7.4  gives  data  on  the  running  times  of  the  algorithms  on  fully  dense  problems  of  size 
n  =  1024,  on  a  16,384  processor  CM2.  Each  running  time  is  given  in  seconds  and  is  the  average 
of  five  problems.  The  Hybrid  algorithm  is  faster  than  both  the  Jacobi  and  Gauss-Seidel  codes 
in  all  cost  ranges;  this  was  true  as  well  for  a  number  of  smaller  problem  sizes.  Further,  this 
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also  held  true  for  a  variety  of  settings  of  the  parameters  of  the  code,  such  as  factor  by  which  we 
divide  epsilon  in  each  iteration,  the  initial  percentage  matched,  etc.  This  led  us  to  believe  that 
the  superiority  of  the  hybrid  approach  was  fairly  robust  with  respect  to  modest  modifications 
to  the  algorithms. 

The  parameter  settings  that  proved  to  work  best  are  80%  for  the  initial  percent  to  be 
matched  and  a  factor  of  2  to  divide  epsilon,  except  for  the  Gauss-Seidel  code  where  a  factor  of 
4  seemed  superior.  In  joint  work  with  Vu  Lephan  that  is  reported  on  in  [120],  we  discovered 
that  the  same  parameter  settings  were  the  best  for  a  sparse  implementation  of  a  Jacobi  auction 
algorithm,  over  a  wide  range  of  problem  sizes  and  degrees  of  sparsity. 

It  is  interesting  and  not  intuitive  that  the  Gauss-Seidel  code  would  often  do  as  well  or  better 
than  the  Jacobi  code,  especially  when  one  considers  Connection  Machine  time  alone.  However, 
one  must  take  into  account  the  fact  that  the  time  of  one  Gauss-Seidel  bid  is  70  -  80  times  faster 
(in  Connection  machine  time)  than  a  Jacobi  parallel  bid  on  a  1024  x  1024  problem  running 
on  a  16K  machine.  The  total  times  are  significantly  larger  than  the  CM  times.  This  partially 
reflects  the  strategy  of  letting  the  front  end  execute  as  much  of  the  inherently  sequential  part  of 
the  problem  as  possible,  but  mostly  reflects  the  Lisp  front  end  code.  Upon  recoding  in  C/Paris 
this  discrepancy  basically  disappeared. 

Based  on  this  testing  we  chose  the  hybrid  algorithm  as  the  most  successful  and  recoded  it 
in  C/Paris.  The  C  front  end  code  runs  much  faster,  and  this  led  to  significant  improvements  in 
the  overall  running  times;  the  discrepancy  between  front  end  times  and  CM  time  became  very 
small.  In  Tables  7.5  and  7.6  we  give  results  on  the  performance  of  the  algorithms  on  both  16K 
and  32K  machines,  for  problems  of  size  1000-2000,  over  various  cost  ranges.  Both  the  total 
number  of  iterations  and  the  number  of  Jacobi  iterations  are  reported.  Each  number  is  the 
average  of  ten  randomly  generated  examples;  we  used  different  sets  of  examples  for  the  16K 
and  32K  machine  in  order  to  give  some  idea  of  the  variation  possible  in  algorithm  performance 
over  two  similar  sets  of  problems  of  the  same  size  and  cost  range. 

The  number  of  Jacobi  iterations  is  surprisingly  small,  but  during  these  iterations  hundreds 
of  bids  are  carried  out  at  once.  Another  interesting  fact  is  that  for  a  specific  problem  size  and 
cost  range  the  number  of  Jacobi  iterations  is  always  about  the  same;  for  almost  all  size  and 
cost  ranges  it  was  never  more  than  five  away  from  the  average. 


110 


CHAPTER  7.  PARALLEL  ALGORITHMS  FOR  THE  ASSIGNMENT  PROBLEM 


Size 

cost 

Time 

Total  Iterations 

Jacobi  Iterations 

1000 

100 

34.6 

16252 

128 

1000 

1000 

22.2 

13637 

119 

1000 

10000 

15.4 

5031 

141 

1000 

100000 

15.3 

4923 

'  144 

Size 

cost 

Time 

Total  Iterations 

Jacobi  Iterations 

1000 

100 

28.7 

21850 

126 

1000 

1000 

17.5 

10141 

120 

1000 

10000 

8.1 

2000 

141 

1000 

100000 

9.8 

3578 

144 

Table  7.5:  Running  times  of  the  C/Paris  hybrid  auction  algorithm  on  1000  x  1000  problems. 
The  top  table  is  for  a  16K  machine,  the  bottom  table  for  a  32K  machine.  Time  is  in  seconds, 
averaged  over  ten  random  examples.  5  of  the  20  examples  with  cost  range  (1  -  100]  tested  ran 
for  more  than  100000  iterations,  and  their  running  times  are  not  included. 


We  note  that  the  cost  range  100  for  size  1000  problems  is  particularly  difficult  for  our 
implementation;  this  is  reflected  both  in  the  increased  number  of  iterations  and  in  the  fact 
that  on  approximately  25%  of  the  random  instances  we  generated  the  code  ran  for  more  than 
100,000  iterations,  although  it  always  terminated.  After  initial  testing  we  ran  each  code  only 
up  to  100000  iterations  and  thus  those  examples  of  size  1000  and  cost  100  that  ran  longer  are 
not  included  in  the  averages. 

We  also  note  that  given  our  vp-ratio  1  implementation  of  a  Gauss  Seidel  bid,  the  time  for 
one  bid  is  not  dependent  on  problem  size  or  machine  size  as  long  as  n  is  smaller  than  the  size 
of  the  machine.  Our  code  averages  1000  Gauss-Seidel  bids  per  second  on  both  a  16K  and  32K 
CM-2.  This  of  course  is  not  the  case  for  the  Jacobi  phase,  e.g.  one  Jacobi  parallel  bid  for  a 
1000  x  1000  problem  on  a  16K  CM-2  takes  .07  seconds,  whereas  on  a  32K  machine  one  parallel 
bid  takes  .04  seconds.  Therefore,  as  bigger  problems  are  solved  and  the  vp-ratio  increases,  the 
factor  by  .which  the  hybrid  approach  outperforms  the  Jacobi  approach  will  increase.  We  were 
limited  to  2000  x  2000  problems  only  because  of  memory  constraints  of  the  Connection  Machine 
we  were  using.  Currently  Connection  Machines  exist  with  a  factor  of  16  more  memory  than  the 
machine  to  which  we  had  access;  on  such  a  machine  with  16384  processors  we  would  be  able  to 
solve  8000  x  8000  problems;  with  a  fully-configured  machine  with  65, 536  processors  we  should 
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Size 

cost 

Time 

Total  Iterations 

Jacobi  Iterations 

2000 

100 

60.2 

2817 

242 

2000 

1000 

96.3 

60936 

148 

2000 

10000 

60.4 

23657 

152 

2000 

100000 

57.9 

18004 

156 

Size 

:  cost 

Time 

Total  Iterations 

Jacobi  Iterations 

2000 

100 

33.5 

2368 

241 

2000 

1000 

50.3 

22955 

150 

2000 

10000 

45.3 

26739 

151 

2000 

100000 

36.3 

15263 

157 

Table  7.6:  Running  times  of  the  C/Paris  hybrid  auction  algorithm  on  2000  x  2000  problems. 
The  top  table  is  for  a  16K  machine,  the  bottom  table  for  a  32K  machine.  Time  is  in  seconds, 
averaged  over  ten  random  examples. 


be  able  to  solve  a  16000  x  16000  problem. 

7.5.2  The  Performance  of  the  Multiplier  Methods 

We  began  by  testing  the  methods  of  multipliers  on  small  problems  in  order  to  understand  their 
behavior.  The  performance  of  MOM  and  ADMOM  on  randomly  generated  problems  of  size 
64  x  64  and  256  x  256  is  recorded  in  Tables  7.7,  7.8,  and  7.9.  In  Table  7.8  the  entry  (x,  y)  means 
that  ADMOM  ran  for  an  average  of  x  iterations  before  converging,  and  that  it  converged  within 
10000  iterations  for  y/10  examples.  We  imposed  an  arbitrary  cutoff  here  of  10000  iterations. 
An  asterisk  indicates  that  the  methods  never  converged  with  that  parameter  setting  on  that 
cost  range;  an  X  means  that  we  did  not  test  that  combination  fully  since  initial  testing  indicated 
it  would  not  converge.  Based  on  our  experience  with  ADMOM  and  some  preliminary  testing 
of  MOM  we  chose  a  reduced  set  of  c  on  which  to  carefully  test  MOM.  Table  7.8  gives  the 
results;  here  we  imposed  an  larger  arbitrary  cutoff  of  20000  iterations,  since  due  to  the  Gauss- 
Seidel  nature  of  the  algorithm  the  number  of  iterations  is  expected  to  be  larger.  Ten  random 
examples  were  considered  for  each  cost  range;  again,  *  should  be  interpreted  as  meaning  none 
of  the  random  examples  converged. 

The  tests  on  the  size  64  problems  indicate  that  indeed  the  number  of  minimizations  of  the 
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Cost 

C=1 

c=5 

c=10 

c=25 

c=ioo 

c=500 

100 

(1000,2) 

(400,2) 

(400,2) 

(550,2) 

(1400,2) 

(5150,2) 

1000 

(8130,3) 

(2880,10) 

(1540,10) 

(690,10) 

(620,10) 

(3450,8) 

10000 

* 

* 

(9800,1) 

(6040,10) 

(1740,10) 

(730,10) 

100000 

X 

X 

X 

X 

(7750,10) 

(2720,10) 

Table  7,7 :  Performance  of  ADMOM  on  dense  problems  of  size  64.  each  combination  of  cost 
and  c  was  tested  on  10  randomly  generated  problems.  The  entry  (x,  y)  —  average  of  x  iterations 
on  the  y/10  problems  that  converged  within  10000  iterations.  A  *  indicates  that  none  of  the 
tested  examples  in  that  category  converged  within  the  limit,  and  an  X  indicates  that  that 
category  was  not  tested  fully  since  preliminary  testing  indicated  it  was  sure  not  to  converge. 


augmented  lagrangian  is  much  smaller  for  MOM  than  for  ADMOM,  but  the  tests  on  size  256 
problems,  given  in  Table  7.9,  indicate  that  total  number  of  iterations  of  MOM  gets  unmanage¬ 
able  for  size  256  problems,  since  one  minimization  requires  256  iterations  of  the  inner  loop.  A 
further  feature  of  MOM  is  that  fairly  frequently  it  returns  non-integral  but  very  close  to  optimal 
solutions;  this  behavior  was  not  observed  with  ADMOM.  The  number  (out  of  10)  of  solutions 
that  were  integral  is  recorded  in  the  Table  7.8  as  well. 

Both  methods  were  highly  sensitive  to  choice  of  c,  and  different  values  of  c  were  preferable 
for  different  cost  ranges. 

Given  this  preliminary  data  we  deemed  it  unnecessary  to  test  MOM  on  1000  x  1000  problems, 
due  to  its  disappointing  behavior  on  smaller  ones.  We  did  test  ADMOM  on  dense  problems 
of  size  1000  x  1000,  for  c  equal  to  each  of  25,100,500,1000.  We  tested  all  4  cost  ranges  with 
each  c,  on  five  randomly  generated  examples.  None  of  the  examples  converged  within  10000 
iterations,  which  was  approximately  2  minutes  of  time  on  a  32K  CM-2.  We  thus  conclude  that 
for  problems  of  this  size  MOM  and  ADMOM  are  inferior  to  a  hybrid  auction  approach  and  in 
general  not  practical  ways  of  solving  large  dense  problems. 

7,5.3  Comparisons  With  Other  Codes  on  the  Uniform  Distribution 

In  Table  7.10  we  compare  our  computational  results  with  those  reported  by  [4],  [74]  and  [75]  on 
uniformly  randomly  generated  cost  ranges.  Our  algorithm  seems  to  be  comparable  or  superior 
at  cost  ranges  where  the  maximum  cost  is  at  least  10  times  the  problem  size,  and  compares 
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Cost 

c=25 

#  Integral 

o 

o 

y-i 

II 

U 

#  Integral 

c=500 

#  Integral 

100 

(9850,10) 

0 

(18430,1) 

0 

(13300,9) 

0 

1000 

(5766,10) 

7 

(11460,9) 

5 

(16410,10) 

1 

10000 

(7360,10) 

10 

(3827,10) 

9 

(5887,10) 

4 

100000 

* 

0 

(12460,9) 

9 

(4530,10) 

8 

Table  7.8:  Performance  of  MOM  on  dense  problems  of  size  64  of  various  cost  ranges. 


Cost 

c=25 

c=25 

c=100 

c=100 

o 

o 

iO 

II 

o 

c=500 

1000 

(26.5,2) 

* 

(58.5,2) 

* 

(196,1) 

* 

10000 

(86,7) 

(181.7,2) 

(39.9,7) 

* 

(67.8.8) 

* 

100000 

* 

* 

* 

* 

(57.7,10) 

* 

Table  7.9:  Data  on  256  x  256  problems.  We  imposed  an  arbitrary  cutoff  of  20000  iterations. 
First  column  for  each  value  is  ADMOM,  second  is  MOM. 


particularly  poorly  in  small  cost  ranges.  To  the  best  of  our  knowledge  most  of  the  computa¬ 
tional  studies  on  the  auction  algorithm  started  from  a  well-developed  and  tested  code  authored 
by  Dimitri  Bertsekas  and  Paul  Tseng,  and  thus  incorporates  their  experience  in  testing  this 
algorithm.  Due  to  the  different  architecture  on  which  we  were  working  it  was  not  possible 
to  directly  adapt  this  code,  and  we  were  interested  largely  in  the  issues  involved  in  creating 
an  efficient  massively  parallel  implementation  of  this  algorithm.  We  believe  that  some  of  the 
heuristics  developed  by  Bertsekas  and  Tseng  could  be  incorporated  into  our  code  in  modified 
form  and  potentially  significantly  improve  the  number  of  iterations  required. 


7.6  Other  Input  Distributions 

We  also  tested  our  hybrid  auction  algorithm  on  dense  problems  with  costs  generated  from  a 
Cauchy  distribution.  For  each  edge  we  randomly  and  uniformly  chose  an  integral  x  €  [0, 1000], 
and  set  the  cost  of  the  edge  to  be  C{x),  where 

h 

i  ,  (r— 500)3  • 

1  +  b3 


C(X)  = 
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Cost 

Hybrid 

Balas  et.  al 

Kempka  Kennington  Zaki 

Zaki 

Kennington  Wang 

100 

28.7 

2.01 

.758 

.56 

5.22 

1000 

17.5 

9.39 

12.083 

12.49 

8.27 

10000 

8.1 

11.70 

11.48 

11.98 

11.34 

100000 

9.8 

- 

- 

- 

13.61 

Cost 

Hybrid 

Balas  et.  al 

Kempka  Kennington  Zaki 

Zaki 

Kennington  Wang 

100 

33.5 

5.52 

3.813 

1.55 

- 

1000 

50.3 

23.20 

255.56 

257.2 

- 

10000 

45.3 

30.09 

32.96 

32.8 

- 

100000 

36.3 

- 

- 

- 

- 

Table  7.10:  Comparison  of  the  hybrid  auction  code  on  a  32K  machine  with  other  parallel 
codes.  Times  are  given  in  seconds;  the  top  table  is  n  =  1000  and  the  bottom  is  n  =  2000.  We 
compare  with  the  Jacobi  auction  code  results  of  Kempka,  Kennington  and  Zaki,  the  Gauss- 
Seidel  auction  code  results  of  Zaki,  and  two  SAP  codes.  These  codes  are  discussed  in  the 
introduction.  An  entry  with  a  —  indicates  those  researchers  do  not  report  results  for  that 
range. 


We  tented  dense  problems  of  size  =  128,256  and  512.  For  each  of  these  we  tested  three 
different  settings  of  the  6,-:  (1)  bx  =  10000,  b2  =  500,  (2)  bx  =  10000, 6a  =  100  and  (3)  bx  = 
1000,62  =  500.  We  ran  the  hybrid  algorithm  with  thresh  1  =  thresh2  =  >/n,  and  compared 
it  to  the  Jacobi  algorithm.  For  each  setting  of  the  6,-  and  each  problem  size  we  generated  ten 
examples.  We  also  ran  the  hybrid  code  on  problems  with  uniformly  generated  costs  in  the  same 
cost  ranges,  for  the  sake  of  comparison. 

The  results  of  these  experiments  are  essentially  identical  to  those  from  experiments  on  the 
uniform  distribution.  Since  we  ran  these  experiments  on  a  8192-processor  Connection  Machine, 
we  ran  somewhat  smaller  problems  and  can  not  compare  our  results  directly,  but  (1)  the  running 
times  of  the  hybrid  algorithm  on  the  Cauchy  distribution  were  comparable  to  those  on  the 
uniform  distribution,  and  (2)  the  qualitative  behavior  that  led  to  the  success  of  the  hybrid 
algorithm,  namely  the  large  sequential  tail,  was  observed  as  well.  Average  speedups  observed 
on  the  512  x  512  problems  were  in  the  range  of  5  -  10. 
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7.7  Other  Possible  Approaches 

In  this  section  we  discuss  several  other  possible  approaches  to  designing  Connection  Machine 
algorithms  for  dense  assignment  problems. 

Matching  by  Matrix  Inversion 

From  a  theoretical  perspective,  the  fastest  algorithm  for  the  assignment  problem  is  the  ran¬ 
domized  algorithm  of  Mulmuley,  Vazirani  and  Vazirani  [90].  This  algorithm  is  quite  simple 
conceptually,  requiring  only  the  calculation  of  a  matrix  inverse  and  a  matrix  determinant.  Un¬ 
fortunately,  in  order  to  guarantee  with  probability  \  that  the  algorithm  will  succeed  the  edge 
weights  must  be  scaled  up  by  a  factor  of  n3.  Since  the  entries  of  the  matrix  to  be  inverted  are 
exponential  in  the  edge  weights,  this  computation  is  impossible  for  the  cost  ranges  we  consider. 

When  the  costs  of  the  edges  are  generated  randomly  and  uniformly  in  a  large  enough  cost 
range,  the  techniques  of  [90]  indicate  that  even  if  the  edge  weights  are  not  scaled  up  further  the 
algorithm  will  succeed  with  high  probability.  Even  the  unsealed  edge  weights,  however,  lead  to 
unmanageably  large  matrix  entries.  We  experimented  with  computing  only  with  the  first  digit 
or  two  of  the  edge  weights;  these  computations  yielded  close  approximations  to  the  optimal 
assignment  value  but  never  in  our  experiments  yielded  the  optimum  value.  Therefore  we  could 
find  no  way  to  convert  the  algorithm  of  [90]  into  a  reasonable  Connection  Machine  algorithm. 

Shortest  Augmenting  Paths 

In  its  most  basic  form  the  shortest  augmenting  paths  algorithm  solves  the  assignment  problem 
by  finding  n  shortest  paths.  Each  of  these  shortest  paths  computations  would  require  at  least  a 
few  spread  operations;  several  thousand  spread  operations  would  be  much  slower  than  a  hybrid 
auction  algorithm  on  a  1000  x  1000  or  2000  x  2000  problem. 

A  modified  form  of  the  shortest  paths  approach  involves  an  initial  auction-like  phase  that 
accomplishes  a  large  number  of  assignments;  the  last  assignments  are  established  by  shortest 
paths  computations  [67].  Such  an  approach  has  potential  to  be  competitive  with  the  hybrid 
auction  algorithm;  however,  the  fact  remains  that  the  shortest  paths  computations  require 
spread  operations  at  a  high  vp-ratio,  in  contrast  to  the  hybrid  algoritb  which  processes  the 
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tail  with  global  maximum  operations  at  vp-ratio  1.  Since  these  can  be  up  to  a  factor  of  100 
faster  than  a  spread  for  the  problem  and  machine  sizes  we  have  discussed,  it  is  very  unlikely 
that  a  shortest-paths  based  approach  can  outperform  the  hybrid  auction  algorithm. 

Network  Simplex 

It  is  the  general  consensus  of  the  computational  optimization  community  that  the  network 
simplex  algorithm  is  inferior  to  both  the  auction  algorithm  and  the  shortest  paths  algorithm 
as  a  sequential  algorithm  for  the  assignment  problem  [75].  It  may  be  possible  to  do  one  pivot 
quickly  in  parallel,  in  which  case  the  performance  of  the  algorithm  depends  on  the  number  of 
iterations.  To  do  one  pivot  quickly  in  parallel  one  would  have  to  be  able  to  quickly  find  the 
path  between  two  vertices  of  a  tree;  the  tree  only  changes  by  one  arc  from  iteration  to  iteration. 
Doing  0(n)  or  even  O(logn)  spreads  to  find  this  path  would  be  too  slow,  since  spreads  at  high 
vp-ratios  are  slow  in  comparison.  Therefore  just  using  a  standard  path-finding  algorithm  would 
be  ineffective.  We  do  not  see  how  to  do  this  in  a  constant  number  of  Connection  Machine 
operations,  although  we  have  not  studied  the  problem  in  great  depth. 

7.8  Conclusions 

We  have  seen  that  for  large  dense  problems  the  two  methods  of  multipliers  take  good  advantage 
of  the  massive  parallelism  of  the  Connection  Machine,  but  are  not  computationally  effective 
methods  for  dense  problems  due  to  the  large  number  of  iterations  required  for  convergence.  The 
auction  algorithm  is  more  difficult  to  implement  efficiently  on  the  Connection  Machine,  but  we 
have  presented  several  methods  to  achieve  an  implementation  of  a  variant  of  this  algorithm 
that  is  competitive  with  the  best  MIMD  algorithms  on  large  problems. 

Massive  Parallelism  is  best  exploited  algorithms  which  use  little  complex  communication 
and  must  process  a  huge  amount  of  data  processed  at  every  moment.  Our  experiences  with  the 
auction  algorithm  as  a  method  to  solve  the  assignment  problem  indicate  that  it  does  not  fall  into 
this  paradigm.  This  is  not  an  isolated  phenomenon,  but  appears  in  a  number  of  combinatorial 
approaches  to  optimization  problems  [121].  We  have,  however,  established  several  methods 
that  can  bring  combinatorial  techniques  closer  to  fast  implementations  on  massively  parallel 
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architectures.  It  would  seem  that  currently  the  best  approach  is  to  utilize  massive  parallelism 
while  the  problem  continues  to  be  massively  parallel,  and  then  to  switch  to  another  technique 
that  is  better  suited  to  the  tail  of  the  problem.  VVe  have  yet,  however,  to  produce  CM  techniques 
for  this  problem  that  are  significantly  better  thanthose  implemented  on  smaller  scale  MIMD 
machines.  It  is  a  challenging  open  problem  to  m^e  fully  exploit  massive  parallelism  in  this 
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